Fall 2018, DSPA (HS650)

SID: 3911 UMich E-mail:

I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

Water Quality Prediction and Classification of Continental US lakes

Source: New York Times (2017)

Abstract

Natural environmental systems are very complex and one such example is the ecosystems of lakes. Due to thier invaluable importance understanding the systems with the help of data-driven machine learning algorithms is important. Currently, there is large volume of data that is available for processing and analysis. We can use Data science tools on this dataset to understand the lakes and the processes. I have applied different predictive models to assess the predictability of chlorophyll concentration in lakes. The models were trained and tested resulting in a good result which can be used with some caution and also further improved up on. In addition, I have analyzed the data by using different clustering algorithms to understand the type of lakes in the United States. The results indicate that at least there are two unique clusters of lakes and clustering algorithms can also be used to classify lakes to identify their trophic states.

Motivation

Data science is an exciting and a very power full field of study allowing us mine plethora of available data to help us understand phenomena that cannot be explored using other traditional methods. This is also true when it comes to environmental data, both from natural and engineered systems being very complex and often handled with process based mathematical and numerical models. This tools alone, although they are powerful and have been used successfully, using data science approaches can greatly reduce the efforts put to develop this tool and, in the area, where they fail. To demonstrate this idea, I have chosen to apply the different data science and predictive analytics methods on environmental data. Specifically, I will be working on water quality data for the continental US lakes.

About the data

The U.S. Environmental Protection Agency (USEPA) has implemented National Lakes Assessment project partnering with States and Tribes. The NLA is a national scale sampling program to understand current conditions, track environmental improvements and to inform decisions at a larger scale 1 2. In this effort, USEPA conducts field sampling for over 1000 lakes across the continental US. The sampling has been done so far for the year 2007 and 2012. The sampling included lakes (lakes, reservoirs, and ponds) within the 48 contiguous United States which have area greater than 1 ha and depth greater than 1 m. The resulting data comprises biological, chemical and human land use features 3. The data is publicly available at the USEPA’s website at the following address [https://www.epa.gov/national-aquatic-resource-surveys/nla]

Introduction

Lakes are inland water bodies that one of the main sources of freshwater for human beings. Lakes have variety of social and economic uses such as water supply, fishing, water transport, recreation, and tourism. Lakes ecosystem is a very complex system which comprises physical, chemical and biological components. This complex system is highly dynamic and also vulnerable over time under the influence of natural of humans. For these reasons protecting our lakes is critical for their continued social, environmental and economic benefits.

However, the ever increasing population and urbanization are posing a serious problem to our lake systems leading to drying of lakes and impairing them with dangerous pollutants. One of the major problem in lake systems is eutrophication and loss of biodiversity 4. Even though lakes natural evolve and change their characteristics, this natural process can be further accelerated or lose its balance due to human influence. This occurs mainly when untreated wastewater are disposed directly or as runoff from urban and agricultural area through storm water. The wastewater changes the chemistry of the lakes often causing nutrient enrichment, this phenomenon is called Lake Eutrophication 5 6. This could accelerates biological growth such as algal blooms. Therefore, monitoring and evaluating the conditions of our lakes is critical for their sustainability.

One field of study that deals with the science of lakes and freshwater ecosystem dynamics is limnology. Key water quality parameters that are studies in limnology include phosphorus, nitrogen, transparency (secchi depth), chlorophyll a, dissolved oxygen and temperature 7. Of these chlorophyll a, which is a green pigment, is directly linked with the growths green algae allowing them to photosynthesize 8. Hence, Chlorophyll-a is tested in lakes to determine how much algae is in the lake. Excessive growth of algae such as harmful algal blooms (HABs) are becoming global epidemic. Once such recent incident is the case of Lake Erie during the summer 2014 cause shutdown of water supply system in Toledo, Ohio 9. Therefore, predicting chlorophyll a concentration is critical to avoid such impactful occurrences of HABs and help policy decisions.

Conventionally, complex dynamic three dimensional empirical and process based numerical models are used to understand the processes in lakes due their complexity and heterogeneity 10. These simulation models require large and diverse data in addition to computational power and often each lake is treated separately as each lake is unique. However, there is large volume data available that have not been analyzed using current Data science tools and methods. Therefore, the aim this project is to demonstrate the use of current Data science techniques for lake water quality modeling and assessment. For this I have processed and analyzed lake water quality data focusing on model development for chlorophyll a concentration and unsupervised classification of US lakes.

Research questions

In general, I want to demonstrate the power of Data science tools and techniques to process, visualize and do an indepth analysis of data to again insights and understand of the system. Specifically, the main questions I would like to answer in this project include the following:

  • Can we reliably predict chlorophyll-a concentration in lakes?
  • What are the important features need for chlorophyll-a prediction?
  • What are the different clusters of US lakes?
    • Is there any significant difference between urban and rular, and manmande and natural lakes?


Load the data

EPA provides the data for the lakes water quality on there website for free at [wwww.epa.com]. I have downloaded the csv files from this website and stored it in my local hard disk. In this section I will import several of these dataset that will be used for the analysis in this project.

Load 2007 data

library(dplyr)
library(reshape2)

# Read landused metrics
lakes_lu07 <- read.csv("data/Lakes/final/nla2007_basin_landuse_metrics.csv", header = T, stringsAsFactors = F)
dim(lakes_lu07)
## [1] 1157   59
# Read lakes dissolved oxygen data
lakes_do07 <- read.csv("data/Lakes/final/nla2007_meando_cond.csv", header = T, stringsAsFactors = F)
dim(lakes_do07)
## [1] 1252   17
# Read lakes Physical habitat condition data
lakes_phy07 <- read.csv("data/Lakes/final/nla2007_phab_indexvalues.csv", header = T, stringsAsFactors = F)
dim(lakes_phy07)
## [1] 1442   45
# Read sampled lakes information
lakes_info07 <- read.csv("data/Lakes/final/nla2007_sampledlakeinfo.csv", header = T, stringsAsFactors = F)
dim(lakes_info07)
## [1] 1252   47
# Read lakes secchi depth data
lakes_secchi07 <- read.csv("data/Lakes/final/nla2007_secchi.csv", header = T, stringsAsFactors = F)
dim(lakes_secchi07)
## [1] 1252    3
# Read lakes trophic level estimate data
lakes_troph07 <- read.csv("data/Lakes/final/nla2007_trophic_conditionestimate.csv", header = T, stringsAsFactors = F)
dim(lakes_troph07)
## [1] 1252   12
# Read lakes trophic level estimate data
lakes_chem07 <- read.csv("data/Lakes/final/EPA_NLA_CHEM07.csv", header = T, stringsAsFactors = F)
dim(lakes_chem07)
## [1] 1326   28

Next, let us join the data frames into one data frame using left_join function from dplyr package by their site id .

lakes.07 <- full_join(lakes_chem07, lakes_do07) %>% full_join(lakes_info07) %>%
  full_join(lakes_phy07) %>% full_join(lakes_secchi07) %>% full_join(lakes_troph07)

dim(lakes.07)
## [1] 3652  120
length(unique(lakes.07$SITE_ID))
## [1] 1157
#write.csv(lakes.07, "lakes.07.csv")
names(lakes.07)
##   [1] "SITE_ID"         "DATE_COL"        "DATECHEM"       
##   [4] "PH_FIELD"        "PH_LAB"          "COND"           
##   [7] "ANC"             "TURB"            "TOC"            
##  [10] "DOC"             "NH4N_PPM"        "NO3_NO2"        
##  [13] "NTL_PPM"         "PTL"             "CL_PPM"         
##  [16] "NO3N_PPM"        "SO4_PPM"         "CA_PPM"         
##  [19] "MG_PPM"          "NA_PPM"          "K_PPM"          
##  [22] "COLOR"           "SIO2"            "NH4ION"         
##  [25] "CATSUM"          "ANSUM2"          "CHLA"           
##  [28] "SECMEAN"         "VISIT_NO"        "LAT_DD"         
##  [31] "LON_DD"          "ST"              "EPA_REG"        
##  [34] "URBAN"           "NUT_REG"         "NUTREG_NAME"    
##  [37] "ECO_NUTA"        "LAKE_ORIGIN"     "ECO3_X_ORIGIN"  
##  [40] "REF_CLUSTER"     "RT_NLA"          "HUC_2"          
##  [43] "HUC_8"           "DO2_2M"          "SAMPLED"        
##  [46] "REPEAT"          "ALBERS_X"        "ALBERS_Y"       
##  [49] "FLD_LON_DD"      "FLD_LAT_DD"      "STATE_NAME"     
##  [52] "CNTYNAME"        "NHDNAME"         "LAKENAME"       
##  [55] "MDCATY"          "WGT"             "WGT_NLA"        
##  [58] "ADJWGT_CAT"      "WSA_ECO3"        "WSA_ECO9"       
##  [61] "ECO_LEV_3"       "ECO_L3_NAM"      "REFCLUS_NAME"   
##  [64] "REF_NUTR"        "AREA_HA"         "SIZE_CLASS"     
##  [67] "LAKEAREA"        "LAKEPERIM"       "SLD"            
##  [70] "DEPTH_X"         "DEPTHMAX"        "ELEV_PT"        
##  [73] "COM_ID"          "REACHCODE"       "DATEPHAB"       
##  [76] "CLASSP"          "EcoP6"           "ssiBedBld"      
##  [79] "ssiNATBedBld"    "rviLowWood"      "RVegQ_7"        
##  [82] "RVegQ_8"         "L_RVegQ_8"       "LRCVQ_7A"       
##  [85] "LRCVQ_7B"        "LRCVQ_7C"        "LRCVQ_7D"       
##  [88] "LRCVQ_8D"        "L_LRCVQ_8D"      "ElevXLat"       
##  [91] "ElevDLat"        "ElevXLon"        "XER"            
##  [94] "XER_X_Elev"      "WMT"             "RDisIXAgAdj5"   
##  [97] "RDisIXAgAdj4"    "RDisIXAgAdj3"    "RDisIXAgAdj2"   
## [100] "RDis_InEx"       "RDis_IX"         "RvegQ_Var"      
## [103] "RVegQ"           "LogRVegQ"        "LitCvrQ_Var"    
## [106] "LitCvrQ"         "LogLitCvrQ"      "LitRipCVQ_Var"  
## [109] "LitRipCVQ"       "LogLitRipCvQ"    "RVeg_OE"        
## [112] "LitCvr_OE"       "LitRipCvr_OE"    "DATE_SECCHI"    
## [115] "NTL"             "CLEAR_TO_BOTTOM" "TSTATE_TP"      
## [118] "TSTATE_TN"       "TSTATE_CHL"      "TSTATE_SECCHI"

This data has 3652 obseravations and 120 features, but only 1157 unique site ids. This implies, some of the lakes have one sampling while some have more tha sampling done in the year 2007. Thus, to be consistent and simplify the analysis we can take the average for those lakes with sampling effort more than one.

# Aggregate only the numeric features by site id
lakes.07.agg <- lakes.07 %>% group_by(SITE_ID) %>% summarise_if(is.numeric, mean, na.rm=T) %>% ungroup()

# Aggregate the character features
lakes.07.char <- lakes.07 %>% group_by(SITE_ID) %>% summarise_if(is.character, max)

# Check if the two dataframes match
length(intersect(lakes.07.agg$SITE_ID, lakes.07.char$SITE_ID))
## [1] 1157
# combine the two data frames
lakes.07.df <- data.frame(left_join(lakes.07.agg, lakes.07.char))
## Joining, by = "SITE_ID"
dim(lakes.07.df)
## [1] 1157  120
length(unique(lakes.07.df$SITE_ID))
## [1] 1157

Load 2012 data

In this section we will load the data for 2012 and similart to the 2007, we will aggregate the different data inton one data frame.

library(dplyr)
# Read lakes water chemistry data
lakes_chem12 <- read.csv("data/Lakes/final/nla2012_waterchem_wide.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_chem12)
## [1] 1230   21
# Read lakes water chemistry profile data
lakes_prof12 <- read.csv("data/Lakes/final/nla2012_wide_profile.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_prof12)
## [1] 30717    56
# Aggregate profile data using the average

lakes_prof_agg <- aggregate(cbind(OXYGEN, PH, TEMPERATURE, DEPTH, CONDUCTIVITY) ~ UID, FUN = mean, data=lakes_prof12)
lakes_prof_agg2 <- aggregate(cbind(INDEX_LAT_DD, INDEX_LON_DD) ~ UID, FUN = mean, data=lakes_prof12)


# Lakes site information
lakes_info12 <- read.csv("data/Lakes/final/nla2012_wide_siteinfo.csv", 
                         header = T, stringsAsFactors = F)

# Read lakes secchi depth
lakes_secch12 <- read.csv("data/Lakes/final/nla2012_secchi.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_secch12)
## [1] 1221   13
# Read lakes chlorophyl data
lakes_chla12 <- read.csv("data/Lakes/final/nla2012_chla_wide.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_chla12)
## [1] 1230    7
# Read key variables data
lakes_key12 <- read.csv("data/Lakes/final/nla12_keyvariables_data.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_key12)
## [1] 1138   50
lakes_key12small <- select(lakes_key12, UID,SITE_ID, MMI_BENT_NLA12, MMI_ZOOP_NLA6, AMITOTAL, LitCvrQc3OE, LitRipCvrQc3OE, RDis_IX, RVegQc3OE, MICX_RESULT, ATRAZINE_RESULT, TOTALHG_RESULT, METHYLMERCURY_RESULT)

# Read lakes condition data
lakes_cond12 <- read.csv("data/Lakes/final/nla_2012_condition_categories.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_cond12)
## [1] 1038   51
# Read lakes condition data
lakes_phyto12 <- read.csv("data/Lakes/final/nla2012_wide_phytoplankton_count.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_phyto12)
## [1] 38627    29
# Read lakes physical habitat condition data
lakes_phab12 <- read.csv("data/Lakes/final/nla2012_wide_phab.csv", 
                         header = T, stringsAsFactors = F)
dim(lakes_phab12)
## [1] 33187   115
lakes_basinarea <- select(lakes_lu07, SITE_ID, BASINAREA_HA)
names(lakes_basinarea)[1] <- "SITEID_07"

# Merge all the data frames into one


lakes_data <- left_join(lakes_chem12,lakes_chla12, by="UID")
lakes_data <- left_join(lakes_data, lakes_cond12, by="UID")
lakes_data <- left_join(lakes_data, lakes_info12, by="UID")
lakes_data <- left_join(lakes_data, lakes_prof_agg, by="UID")
lakes_data <- left_join(lakes_data, lakes_prof_agg2, by="UID")
lakes_data <- left_join(lakes_data, lakes_secch12, by="UID")
lakes_data <- left_join(lakes_data, lakes_key12, by="UID")
#lakes_data <- left_join(lakes_data, lakes_basinarea, by="SITEID_07")

lakes_phyto <- aggregate(ABUNDANCE ~ UID, FUN = sum, data=lakes_phyto12)
lakes_data <- left_join(lakes_data, lakes_phyto)
## Joining, by = "UID"
# Remove reapted columns
lakes_data <- select(lakes_data, -ends_with(".y"))
dim(lakes_data)
## [1] 1230  170
#Clean column names
names(lakes_data) <- gsub(".x|_RESULT", "", names(lakes_data))

# Select key features
select_lakes <- subset(lakes_data, select=c(UID:TURB, CHLX, CHLL, OXYGEN, PH, TEMPERATURE, DEPTH, CONDUCTIVITY,SECCHI, AREA_HA, ELEVATION,ABUNDANCE,PERIM_KM, INDEX_LAT_DD, INDEX_LON_DD, MMI_BENT_NLA12, MMI_ZOOP_NLA6, AMITOTAL, LitCvrQc3OE, LitRipCvrQc3OE, RDis_IX, RVegQc3OE, MICX, ATRAZINE, METHYLMERCURY, TOTALHG,  URBAN, LAKE_ORIGIN))

select_lakes$URBAN <- as.factor(select_lakes$URBAN)
select_lakes$LAKE_ORIGIN <- as.factor(select_lakes$LAKE_ORIGIN)

dim(select_lakes)
## [1] 1230   48

Summaries of the data

First, I will define a custom function to calculate basic summaries of the features and returns in a data frame format. Then this function will be used to do summary statistics on the data.

# Summary function
# Calculate summary statistics

summaries <- function(x){
  x_summary <- matrix(0, ncol(x),5)
  for (i in 1:ncol(x)) {
    x_summary[i,] <- c(min(x[,i], na.rm=T), 
                       max(x[,i], na.rm = T), 
                       sd(x[,i], na.rm = T), 
                       mean(x[,i], na.rm=T), 
                       sum(is.na(x[,i])) )
   }

  x_summary <- data.frame(round(x_summary,3), row.names = names(x))
  names(x_summary) <- c("min", "max", "sd", "mean", "NAs")
  return(x_summary)
  
}

Summary of 2012 data

We can use the above defined function to do summary statistics on the aggregated data for the year 2012.

library(kableExtra)
# Let's drop TSS which has mostly missing data and reapted var PH.1
select_lakes <- select(select_lakes, -TSS, -PH.1)

lakes_summ <- summaries(select_lakes[,2:44])

lakes_summ %>% kable(digits = 3)%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
min max sd mean NAs
ALUMINUM 0.000 3.793 0.145 0.023 168
AMMONIA_N 0.000 3.180 0.132 0.037 1
ANC -3361.400 203857.000 10133.962 2696.844 0
CALCIUM 0.121 594.900 45.795 28.325 0
CHLORIDE 0.036 18012.742 555.925 53.542 0
COLOR 0.000 840.000 39.935 25.315 1
COND 2.820 64810.000 3162.987 688.356 0
DOC 0.230 515.810 18.544 8.287 0
MAGNESIUM 0.067 1023.000 68.048 24.561 0
NITRATE_N 0.000 51.660 1.555 0.108 52
NITRITE_N 0.000 0.419 0.013 0.001 168
NTL 0.014 54.000 2.493 1.123 0
PH 2.830 10.470 0.845 8.072 0
POTASSIUM 0.041 3376.000 121.946 11.799 2
PTL 4.000 3636.000 269.738 114.750 0
SILICA 0.022 935.000 29.725 10.882 0
SODIUM 0.093 29890.000 1093.407 109.473 1
SULFATE 0.026 47325.202 1681.334 194.297 0
TURB 0.010 447.150 27.294 9.451 41
CHLX 0.000 764.640 54.082 26.118 5
CHLL 0.000 584.640 58.070 27.332 9
OXYGEN 0.250 23.686 2.447 6.751 170
TEMPERATURE 6.659 335.001 11.639 21.765 170
DEPTH 0.000 25.000 3.640 3.264 170
CONDUCTIVITY 1.780 68863.636 3828.352 773.638 170
SECCHI 0.020 28.000 2.311 2.134 75
AREA_HA 1.033 167489.609 7501.797 861.475 100
ELEVATION -53.270 3594.970 758.063 646.201 100
ABUNDANCE 267.000 67860.000 5494.833 3685.472 0
PERIM_KM 0.394 3520.774 125.610 21.095 100
INDEX_LAT_DD 26.070 48.985 4.950 40.760 0
INDEX_LON_DD -124.230 -67.194 14.842 -95.078 0
MMI_BENT_NLA12 0.000 86.100 15.820 43.188 222
MMI_ZOOP_NLA6 0.050 94.650 17.583 55.213 96
AMITOTAL 0.000 2.045 0.353 0.360 117
LitCvrQc3OE 0.000 5.636 0.717 0.881 117
LitRipCvrQc3OE 0.000 4.696 0.469 0.735 119
RDis_IX 0.000 0.950 0.285 0.488 124
RVegQc3OE 0.000 4.532 0.508 0.701 119
MICX 0.000 66.690 3.535 0.669 92
ATRAZINE 0.000 9.700 0.486 0.127 94
METHYLMERCURY 0.080 19.000 1.568 1.218 222
TOTALHG 4.840 859.730 93.894 103.161 227

AS it can be seen from the above result we have some missing values and also two of the features–Elevation and ANC(Acid neutralizing capacity) have some negative values. In the next section will try to impute the missing values using the mice and mi packages.

Summary of 2017 data

We can use the above defined function to do summary statistics on the aggregated data for the year 2007. The summary will be done on the numeric data.

# Summary for the numeric features
lakes07_summ <- summaries(select_if(lakes.07.df, is.numeric))

lakes07_summ %>% kable(digits = 3)%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
min max sd mean NAs
PH_FIELD 4.100 1.030000e+01 8.990000e-01 8.080000e+00 9
PH_LAB 4.160 1.009000e+01 7.650000e-01 8.014000e+00 0
COND 4.350 5.059000e+04 2.470393e+03 6.652670e+02 0
ANC -62.960 9.163290e+04 5.830707e+03 2.620119e+03 0
TURB 0.147 5.740000e+02 3.413200e+01 1.291600e+01 0
TOC 0.370 3.250500e+02 2.002100e+01 1.001200e+01 0
DOC 0.340 2.905700e+02 1.690600e+01 8.878000e+00 0
NH4N_PPM 0.004 1.547000e+00 1.100000e-01 4.200000e-02 0
NO3_NO2 0.002 5.592000e+00 3.330000e-01 6.900000e-02 0
NTL_PPM 0.010 2.610000e+01 2.154000e+00 1.167000e+00 0
PTL 0.714 4.753400e+03 2.698330e+02 1.073340e+02 0
CL_PPM 0.050 1.581431e+04 5.251390e+02 5.672500e+01 0
NO3N_PPM 0.002 5.409000e+00 3.080000e-01 7.300000e-02 0
SO4_PPM 0.083 4.008575e+04 1.528241e+03 1.887880e+02 0
CA_PPM 0.205 4.857000e+02 3.698100e+01 2.782800e+01 0
MG_PPM 0.062 2.466000e+03 1.165320e+02 2.706700e+01 0
NA_PPM 0.193 1.674000e+04 6.380970e+02 8.697200e+01 0
K_PPM 0.050 1.412000e+03 5.395700e+01 9.862000e+00 0
COLOR 0.000 1.650000e+02 1.606700e+01 1.618900e+01 0
SIO2 0.025 9.190700e+01 1.061900e+01 8.783000e+00 0
NH4ION 0.000 9.496400e+01 6.522000e+00 2.561000e+00 0
CATSUM 28.670 9.415390e+05 3.757639e+04 7.653919e+03 0
ANSUM2 32.840 9.632158e+05 4.069073e+04 8.155782e+03 0
CHLA 0.068 8.712000e+02 6.470200e+01 2.852400e+01 5
SECMEAN 0.040 3.671000e+01 2.500000e+00 2.186000e+00 65
VISIT_NO 1.000 1.556000e+00 1.380000e-01 1.041000e+00 0
LAT_DD 26.936 4.897900e+01 5.036000e+00 4.062000e+01 0
LON_DD -124.633 -6.769900e+01 1.422700e+01 -9.465800e+01 0
HUC_2 1.000 1.800000e+01 5.011000e+00 8.455000e+00 0
HUC_8 1010002.000 1.810020e+07 5.016659e+06 8.525925e+06 0
DO2_2M 0.800 2.100000e+01 1.863000e+00 7.845000e+00 55
ALBERS_X -2292120.601 2.203317e+06 1.162261e+06 1.245475e+05 0
ALBERS_Y -1055938.590 1.524035e+06 5.888666e+05 4.369279e+05 0
FLD_LON_DD -124.621 -6.769900e+01 1.423400e+01 -9.465500e+01 1
FLD_LAT_DD 26.802 4.898300e+01 5.034000e+00 4.061600e+01 1
MDCATY 0.000 7.600000e-01 6.400000e-02 5.500000e-02 0
WGT 0.000 7.703160e+02 1.781560e+02 8.549100e+01 0
WGT_NLA 0.000 7.248870e+02 9.154900e+01 4.304700e+01 0
ECO_LEV_3 1.000 8.400000e+01 2.032700e+01 4.415300e+01 0
AREA_HA 4.037 1.674896e+05 8.020313e+03 1.230252e+03 0
LAKEAREA 0.040 1.674896e+03 8.020300e+01 1.230300e+01 0
LAKEPERIM 0.780 2.763418e+03 1.381510e+02 3.299900e+01 0
SLD 1.019 2.863000e+01 2.588000e+00 2.644000e+00 0
DEPTH_X 0.500 9.700000e+01 1.021300e+01 9.449000e+00 17
DEPTHMAX 0.500 9.700000e+01 1.026900e+01 9.494000e+00 1
ELEV_PT 0.000 3.403000e+03 6.788460e+02 6.076520e+02 0
COM_ID 0.000 2.453966e+07 7.235670e+06 1.184537e+07 0
REACHCODE 2060003.000 1.810020e+13 5.047690e+12 8.338619e+12 1
ssiBedBld 0.000 9.640000e-01 1.310000e-01 9.200000e-02 57
ssiNATBedBld 0.000 9.640000e-01 1.080000e-01 4.400000e-02 57
rviLowWood 0.000 1.702000e+00 3.270000e-01 4.210000e-01 59
RVegQ_7 0.000 5.520000e-01 1.070000e-01 1.550000e-01 59
RVegQ_8 0.000 5.750000e-01 1.270000e-01 2.160000e-01 68
L_RVegQ_8 -2.000 -2.330000e-01 4.110000e-01 -7.750000e-01 68
LRCVQ_7A 0.000 9.500000e-01 1.700000e-01 3.120000e-01 60
LRCVQ_7B 0.000 7.060000e-01 1.150000e-01 2.090000e-01 62
LRCVQ_7C 0.000 5.840000e-01 1.030000e-01 1.830000e-01 62
LRCVQ_7D 0.000 3.980000e-01 7.800000e-02 1.250000e-01 62
LRCVQ_8D 0.000 4.450000e-01 8.700000e-02 1.560000e-01 72
L_LRCVQ_8D -2.000 -3.420000e-01 3.150000e-01 -8.660000e-01 72
ElevXLat 0.000 1.412006e+05 2.799303e+04 2.570500e+04 57
ElevDLat 0.000 8.571300e+01 1.707300e+01 1.509200e+01 57
ElevXLon -375708.756 0.000000e+00 7.751939e+04 -6.445280e+04 57
XER 0.000 1.000000e+00 2.660000e-01 7.600000e-02 56
XER_X_Elev 0.000 2.968800e+03 3.982880e+02 9.950400e+01 56
WMT 0.000 1.000000e+00 3.530000e-01 1.450000e-01 56
RDisIXAgAdj5 0.000 9.450000e-01 2.760000e-01 4.740000e-01 57
RDisIXAgAdj4 0.000 9.410000e-01 2.740000e-01 4.690000e-01 57
RDisIXAgAdj3 0.000 9.360000e-01 2.720000e-01 4.640000e-01 57
RDisIXAgAdj2 0.000 9.320000e-01 2.700000e-01 4.560000e-01 57
RDis_InEx 0.000 9.320000e-01 2.670000e-01 4.440000e-01 57
RDis_IX 0.000 9.450000e-01 2.760000e-01 4.740000e-01 57
RVegQ 0.000 5.580000e-01 1.130000e-01 1.710000e-01 60
LogRVegQ -2.000 -2.460000e-01 3.810000e-01 -8.660000e-01 60
LitCvrQ 0.000 1.105000e+00 1.120000e-01 1.220000e-01 60
LogLitCvrQ -2.000 4.700000e-02 3.780000e-01 -1.029000e+00 60
LitRipCVQ 0.000 5.880000e-01 9.400000e-02 1.470000e-01 63
LogLitRipCvQ -2.000 -2.230000e-01 3.200000e-01 -9.020000e-01 63
RVeg_OE 0.000 3.142000e+00 4.760000e-01 7.590000e-01 60
LitCvr_OE 0.000 5.099000e+00 6.760000e-01 8.590000e-01 60
LitRipCvr_OE 0.000 3.041000e+00 4.380000e-01 7.620000e-01 63
NTL 5.000 2.610000e+04 2.152594e+03 1.166017e+03 0
summary(lakes.07.df$URBAN)
##    Length     Class      Mode 
##      1157 character character

As compared to the the 2012 data the 2007 data has fewer missing values. In the next section, I will use mice and im package to impute the missing data.

Handle missing data

library("betareg")
library("mi")

# get show the missing information matrix           
mdf <- missing_data.frame(select_lakes[,-1]) 
## NOTE: The following pairs of variables appear to have the same missingness pattern.
##  Please verify whether they are in fact logically distinct variables.
##      [,1]             [,2]       
## [1,] "ALUMINUM"       "NITRITE_N"
## [2,] "AREA_HA"        "ELEVATION"
## [3,] "AREA_HA"        "PERIM_KM" 
## [4,] "AREA_HA"        "URBAN"    
## [5,] "ELEVATION"      "PERIM_KM" 
## [6,] "ELEVATION"      "URBAN"    
## [7,] "PERIM_KM"       "URBAN"    
## [8,] "LitRipCvrQc3OE" "RVegQc3OE"
show(mdf)
## Object of class missing_data.frame with 1230 observations on 45 variables
## 
## There are 83 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                            type missing method   model
## ALUMINUM             continuous     168    ppd  linear
## AMMONIA_N            continuous       1    ppd  linear
## ANC                  continuous       0   <NA>    <NA>
## CALCIUM              continuous       0   <NA>    <NA>
## CHLORIDE             continuous       0   <NA>    <NA>
## COLOR                continuous       1    ppd  linear
## COND                 continuous       0   <NA>    <NA>
## DOC                  continuous       0   <NA>    <NA>
## MAGNESIUM            continuous       0   <NA>    <NA>
## NITRATE_N            continuous      52    ppd  linear
## NITRITE_N         SC_proportion     168    ppd betareg
## NITRITE_N:is_zero        binary     168    ppd   logit
## NTL                  continuous       0   <NA>    <NA>
## PH                   continuous       0   <NA>    <NA>
## POTASSIUM            continuous       2    ppd  linear
## PTL                  continuous       0   <NA>    <NA>
## SILICA               continuous       0   <NA>    <NA>
## SODIUM               continuous       1    ppd  linear
## SULFATE              continuous       0   <NA>    <NA>
## TURB                 continuous      41    ppd  linear
## CHLX                 continuous       5    ppd  linear
## CHLL                 continuous       9    ppd  linear
## OXYGEN               continuous     170    ppd  linear
## TEMPERATURE          continuous     170    ppd  linear
## DEPTH                continuous     170    ppd  linear
## CONDUCTIVITY         continuous     170    ppd  linear
## SECCHI               continuous      75    ppd  linear
## AREA_HA              continuous     100    ppd  linear
## ELEVATION            continuous     100    ppd  linear
## ABUNDANCE            continuous       0   <NA>    <NA>
## PERIM_KM             continuous     100    ppd  linear
## INDEX_LAT_DD         continuous       0   <NA>    <NA>
## INDEX_LON_DD         continuous       0   <NA>    <NA>
## MMI_BENT_NLA12       continuous     222    ppd  linear
## MMI_ZOOP_NLA6        continuous      96    ppd  linear
## AMITOTAL             continuous     117    ppd  linear
## LitCvrQc3OE          continuous     117    ppd  linear
## LitRipCvrQc3OE       continuous     119    ppd  linear
## RDis_IX           SC_proportion     124    ppd betareg
## RDis_IX:is_zero          binary     124    ppd   logit
## RVegQc3OE            continuous     119    ppd  linear
## MICX                 continuous      92    ppd  linear
## ATRAZINE             continuous      94    ppd  linear
## METHYLMERCURY        continuous     222    ppd  linear
## TOTALHG              continuous     227    ppd  linear
## URBAN                    binary     100    ppd   logit
## LAKE_ORIGIN              binary     192    ppd   logit
## 
##                     family     link transformation
## ALUMINUM          gaussian identity    standardize
## AMMONIA_N         gaussian identity    standardize
## ANC                   <NA>     <NA>    standardize
## CALCIUM               <NA>     <NA>    standardize
## CHLORIDE              <NA>     <NA>    standardize
## COLOR             gaussian identity    standardize
## COND                  <NA>     <NA>    standardize
## DOC                   <NA>     <NA>    standardize
## MAGNESIUM             <NA>     <NA>    standardize
## NITRATE_N         gaussian identity    standardize
## NITRITE_N         binomial    logit        squeeze
## NITRITE_N:is_zero binomial    logit           <NA>
## NTL                   <NA>     <NA>    standardize
## PH                    <NA>     <NA>    standardize
## POTASSIUM         gaussian identity    standardize
## PTL                   <NA>     <NA>    standardize
## SILICA                <NA>     <NA>    standardize
## SODIUM            gaussian identity    standardize
## SULFATE               <NA>     <NA>    standardize
## TURB              gaussian identity    standardize
## CHLX              gaussian identity    standardize
## CHLL              gaussian identity    standardize
## OXYGEN            gaussian identity    standardize
## TEMPERATURE       gaussian identity    standardize
## DEPTH             gaussian identity    standardize
## CONDUCTIVITY      gaussian identity    standardize
## SECCHI            gaussian identity    standardize
## AREA_HA           gaussian identity    standardize
## ELEVATION         gaussian identity    standardize
## ABUNDANCE             <NA>     <NA>    standardize
## PERIM_KM          gaussian identity    standardize
## INDEX_LAT_DD          <NA>     <NA>    standardize
## INDEX_LON_DD          <NA>     <NA>    standardize
## MMI_BENT_NLA12    gaussian identity    standardize
## MMI_ZOOP_NLA6     gaussian identity    standardize
## AMITOTAL          gaussian identity    standardize
## LitCvrQc3OE       gaussian identity    standardize
## LitRipCvrQc3OE    gaussian identity    standardize
## RDis_IX           binomial    logit        squeeze
## RDis_IX:is_zero   binomial    logit           <NA>
## RVegQc3OE         gaussian identity    standardize
## MICX              gaussian identity    standardize
## ATRAZINE          gaussian identity    standardize
## METHYLMERCURY     gaussian identity    standardize
## TOTALHG           gaussian identity    standardize
## URBAN             binomial    logit           <NA>
## LAKE_ORIGIN       binomial    logit           <NA>
image(mdf)

The observed missingness pattern is not at random, but it is due to the source of the data being from different data frames. Specially data points related to the physical characteristics of the lake such as elevation, lat, and lon should not be imputed, so, we can drop this missing rows.

Start imputation for 2012 data

Now we can start the imputation process for the remaining features.

select_lakes2 <- subset(select_lakes, !is.na(ELEVATION))
lakes_summ <- summaries(select_lakes2[,2:44])
mdf2 <- missing_data.frame(select_lakes2[,-1]) 
## NOTE: The following pairs of variables appear to have the same missingness pattern.
##  Please verify whether they are in fact logically distinct variables.
##      [,1]             [,2]       
## [1,] "ALUMINUM"       "NITRITE_N"
## [2,] "LitRipCvrQc3OE" "RVegQc3OE"
show(mdf2)
## Object of class missing_data.frame with 1130 observations on 45 variables
## 
## There are 64 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                            type missing method   model
## ALUMINUM             continuous     158    ppd  linear
## AMMONIA_N            continuous       1    ppd  linear
## ANC                  continuous       0   <NA>    <NA>
## CALCIUM              continuous       0   <NA>    <NA>
## CHLORIDE             continuous       0   <NA>    <NA>
## COLOR                continuous       1    ppd  linear
## COND                 continuous       0   <NA>    <NA>
## DOC                  continuous       0   <NA>    <NA>
## MAGNESIUM            continuous       0   <NA>    <NA>
## NITRATE_N            continuous      50    ppd  linear
## NITRITE_N         SC_proportion     158    ppd betareg
## NITRITE_N:is_zero        binary     158    ppd   logit
## NTL                  continuous       0   <NA>    <NA>
## PH                   continuous       0   <NA>    <NA>
## POTASSIUM            continuous       2    ppd  linear
## PTL                  continuous       0   <NA>    <NA>
## SILICA               continuous       0   <NA>    <NA>
## SODIUM               continuous       1    ppd  linear
## SULFATE              continuous       0   <NA>    <NA>
## TURB                 continuous      39    ppd  linear
## CHLX                 continuous       5    ppd  linear
## CHLL                 continuous       9    ppd  linear
## OXYGEN               continuous     155    ppd  linear
## TEMPERATURE          continuous     155    ppd  linear
## DEPTH                continuous     155    ppd  linear
## CONDUCTIVITY         continuous     155    ppd  linear
## SECCHI               continuous      67    ppd  linear
## AREA_HA              continuous       0   <NA>    <NA>
## ELEVATION            continuous       0   <NA>    <NA>
## ABUNDANCE            continuous       0   <NA>    <NA>
## PERIM_KM             continuous       0   <NA>    <NA>
## INDEX_LAT_DD         continuous       0   <NA>    <NA>
## INDEX_LON_DD         continuous       0   <NA>    <NA>
## MMI_BENT_NLA12       continuous     208    ppd  linear
## MMI_ZOOP_NLA6        continuous      95    ppd  linear
## AMITOTAL             continuous     114    ppd  linear
## LitCvrQc3OE          continuous     113    ppd  linear
## LitRipCvrQc3OE       continuous     114    ppd  linear
## RDis_IX           SC_proportion     120    ppd betareg
## RDis_IX:is_zero          binary     120    ppd   logit
## RVegQc3OE            continuous     114    ppd  linear
## MICX                 continuous      92    ppd  linear
## ATRAZINE             continuous      94    ppd  linear
## METHYLMERCURY        continuous     130    ppd  linear
## TOTALHG              continuous     135    ppd  linear
## URBAN                    binary       0   <NA>    <NA>
## LAKE_ORIGIN              binary     100    ppd   logit
## 
##                     family     link transformation
## ALUMINUM          gaussian identity    standardize
## AMMONIA_N         gaussian identity    standardize
## ANC                   <NA>     <NA>    standardize
## CALCIUM               <NA>     <NA>    standardize
## CHLORIDE              <NA>     <NA>    standardize
## COLOR             gaussian identity    standardize
## COND                  <NA>     <NA>    standardize
## DOC                   <NA>     <NA>    standardize
## MAGNESIUM             <NA>     <NA>    standardize
## NITRATE_N         gaussian identity    standardize
## NITRITE_N         binomial    logit        squeeze
## NITRITE_N:is_zero binomial    logit           <NA>
## NTL                   <NA>     <NA>    standardize
## PH                    <NA>     <NA>    standardize
## POTASSIUM         gaussian identity    standardize
## PTL                   <NA>     <NA>    standardize
## SILICA                <NA>     <NA>    standardize
## SODIUM            gaussian identity    standardize
## SULFATE               <NA>     <NA>    standardize
## TURB              gaussian identity    standardize
## CHLX              gaussian identity    standardize
## CHLL              gaussian identity    standardize
## OXYGEN            gaussian identity    standardize
## TEMPERATURE       gaussian identity    standardize
## DEPTH             gaussian identity    standardize
## CONDUCTIVITY      gaussian identity    standardize
## SECCHI            gaussian identity    standardize
## AREA_HA               <NA>     <NA>    standardize
## ELEVATION             <NA>     <NA>    standardize
## ABUNDANCE             <NA>     <NA>    standardize
## PERIM_KM              <NA>     <NA>    standardize
## INDEX_LAT_DD          <NA>     <NA>    standardize
## INDEX_LON_DD          <NA>     <NA>    standardize
## MMI_BENT_NLA12    gaussian identity    standardize
## MMI_ZOOP_NLA6     gaussian identity    standardize
## AMITOTAL          gaussian identity    standardize
## LitCvrQc3OE       gaussian identity    standardize
## LitRipCvrQc3OE    gaussian identity    standardize
## RDis_IX           binomial    logit        squeeze
## RDis_IX:is_zero   binomial    logit           <NA>
## RVegQc3OE         gaussian identity    standardize
## MICX              gaussian identity    standardize
## ATRAZINE          gaussian identity    standardize
## METHYLMERCURY     gaussian identity    standardize
## TOTALHG           gaussian identity    standardize
## URBAN                 <NA>     <NA>           <NA>
## LAKE_ORIGIN       binomial    logit           <NA>
image(mdf2)

library(parallel)
detectCores()
## [1] 8
options(mc.cores = 2)

mdf2<-change(mdf2, y=c("ALUMINUM","AMMONIA_N","COLOR","NITRATE_N","NITRITE_N","POTASSIUM","SODIUM","TURB","CHLX","CHLL", "OXYGEN", "TEMPERATURE", "DEPTH","CONDUCTIVITY","SECCHI","MMI_BENT_NLA12", "MMI_ZOOP_NLA6","AMITOTAL", "LitCvrQc3OE", "LitRipCvrQc3OE", "RDis_IX","RVegQc3OE","MICX", "ATRAZINE", "METHYLMERCURY",  "TOTALHG", "LAKE_ORIGIN" ), what = "imputation_method", to="pmm")

#imputations <- mi(mdf2, n.iter = 10, n.chains = 3, max.minutes = 10)
#show(imputations)

## The above code gave error
# Let's try using packge mice
library(mice)
imputed_Data <- mice(select_lakes2[,-1], m=3, maxit = 30, method = 'pmm', seed = 500, printFlag = F)

summary(imputed_Data)
## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
##       ALUMINUM      AMMONIA_N            ANC        CALCIUM       CHLORIDE 
##          "pmm"          "pmm"             ""             ""             "" 
##          COLOR           COND            DOC      MAGNESIUM      NITRATE_N 
##          "pmm"             ""             ""             ""          "pmm" 
##      NITRITE_N            NTL             PH      POTASSIUM            PTL 
##          "pmm"             ""             ""          "pmm"             "" 
##         SILICA         SODIUM        SULFATE           TURB           CHLX 
##             ""          "pmm"             ""          "pmm"          "pmm" 
##           CHLL         OXYGEN    TEMPERATURE          DEPTH   CONDUCTIVITY 
##          "pmm"          "pmm"          "pmm"          "pmm"          "pmm" 
##         SECCHI        AREA_HA      ELEVATION      ABUNDANCE       PERIM_KM 
##          "pmm"             ""             ""             ""             "" 
##   INDEX_LAT_DD   INDEX_LON_DD MMI_BENT_NLA12  MMI_ZOOP_NLA6       AMITOTAL 
##             ""             ""          "pmm"          "pmm"          "pmm" 
##    LitCvrQc3OE LitRipCvrQc3OE        RDis_IX      RVegQc3OE           MICX 
##          "pmm"          "pmm"          "pmm"          "pmm"          "pmm" 
##       ATRAZINE  METHYLMERCURY        TOTALHG          URBAN    LAKE_ORIGIN 
##          "pmm"          "pmm"          "pmm"             ""          "pmm" 
## PredictorMatrix:
##           ALUMINUM AMMONIA_N ANC CALCIUM CHLORIDE COLOR COND DOC MAGNESIUM
## ALUMINUM         0         1   1       1        1     1    1   1         1
## AMMONIA_N        1         0   1       1        1     1    1   1         1
## ANC              1         1   0       1        1     1    1   1         1
## CALCIUM          1         1   1       0        1     1    1   1         1
## CHLORIDE         1         1   1       1        0     1    1   1         1
## COLOR            1         1   1       1        1     0    1   1         1
##           NITRATE_N NITRITE_N NTL PH POTASSIUM PTL SILICA SODIUM SULFATE
## ALUMINUM          1         1   1  1         1   1      1      1       1
## AMMONIA_N         1         1   1  1         1   1      1      1       1
## ANC               1         1   1  1         1   1      1      1       1
## CALCIUM           1         1   1  1         1   1      1      1       1
## CHLORIDE          1         1   1  1         1   1      1      1       1
## COLOR             1         1   1  1         1   1      1      1       1
##           TURB CHLX CHLL OXYGEN TEMPERATURE DEPTH CONDUCTIVITY SECCHI
## ALUMINUM     1    1    1      1           1     1            1      1
## AMMONIA_N    1    1    1      1           1     1            1      1
## ANC          1    1    1      1           1     1            1      1
## CALCIUM      1    1    1      1           1     1            1      1
## CHLORIDE     1    1    1      1           1     1            1      1
## COLOR        1    1    1      1           1     1            1      1
##           AREA_HA ELEVATION ABUNDANCE PERIM_KM INDEX_LAT_DD INDEX_LON_DD
## ALUMINUM        1         1         1        1            1            1
## AMMONIA_N       1         1         1        1            1            1
## ANC             1         1         0        1            1            1
## CALCIUM         1         1         1        1            1            1
## CHLORIDE        1         1         1        1            1            1
## COLOR           1         1         1        1            1            1
##           MMI_BENT_NLA12 MMI_ZOOP_NLA6 AMITOTAL LitCvrQc3OE LitRipCvrQc3OE
## ALUMINUM               1             1        1           1              1
## AMMONIA_N              1             1        1           1              1
## ANC                    1             1        1           1              1
## CALCIUM                1             1        1           1              1
## CHLORIDE               1             1        1           1              1
## COLOR                  1             1        1           1              1
##           RDis_IX RVegQc3OE MICX ATRAZINE METHYLMERCURY TOTALHG URBAN
## ALUMINUM        1         1    1        1             1       1     1
## AMMONIA_N       1         1    1        1             1       1     1
## ANC             1         1    1        1             1       1     1
## CALCIUM         1         1    1        1             1       1     1
## CHLORIDE        1         1    1        1             1       1     1
## COLOR           1         1    1        1             1       1     1
##           LAKE_ORIGIN
## ALUMINUM            1
## AMMONIA_N           1
## ANC                 1
## CALCIUM             1
## CHLORIDE            1
## COLOR               1
## Number of logged events:  2340 
##   it im       dep meth     out
## 1  1  1  ALUMINUM  pmm  SODIUM
## 2  1  1 AMMONIA_N  pmm  SODIUM
## 3  1  1     COLOR  pmm  SODIUM
## 4  1  1 NITRATE_N  pmm  SODIUM
## 5  1  1 NITRITE_N  pmm  SODIUM
## 6  1  1 POTASSIUM  pmm SULFATE
densityplot(imputed_Data)

#plot(imputed_Data)

imp1 <- mice::complete(imputed_Data,1)

imp_mis <- missing_data.frame(imp1)
image(imp_mis)

imp1_sum <- summaries(imp1[,1:43])
diff1 <- imp1_sum - lakes_summ

imp2 <- mice::complete(imputed_Data,2)
imp2_sum <- summaries(imp2[,1:43])
diff2 <- imp2_sum - lakes_summ

imp3 <- mice::complete(imputed_Data,3)
imp3_sum <- summaries(imp3[,1:43])
diff3 <- imp3_sum - lakes_summ

Comparing the three imputed results they are quit similar to each other. For simplicity, let’s continue with the average of the three complete data frames. Alternatively, we can pool the final result by combining all the imputed data frames.

Start imputation for 2007 data

Next, I will repeat the above process for 2017 data using mice and mi packages. To simplify I use only select feature out of 120, which some are also duplicated.

# Select relevant features
vars.07 <- c("SITE_ID", "PH_FIELD", "PH_LAB", "COND", "ANC", "TURB", "TOC", "DOC", "NH4N_PPM", "NO3_NO2", "NTL_PPM", "PTL","CL_PPM", "NO3N_PPM", "SO4_PPM", "CA_PPM","MG_PPM" , "NA_PPM", "K_PPM",  "COLOR", "SIO2", "NH4ION", "CHLA",  "SECMEAN", "LAT_DD", "LON_DD", "DO2_2M", "AREA_HA",  "LAKEPERIM" ,"SLD", "DEPTH_X", "DEPTHMAX", "ELEV_PT", "RDis_IX", "LitCvrQ", "LitRipCvr_OE", "RVeg_OE", "NTL","URBAN", "LAKE_ORIGIN", "REF_CLUSTER")

select.07.data <- select(lakes.07.df, vars.07)
mdf2007 <- missing_data.frame(select.07.data[,-1]) 
## Warning in .local(.Object, ...): PH_FIELD : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): CHLA : some observations are NaN, changing
## to NA
## Warning in .local(.Object, ...): SECMEAN : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DO2_2M : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DEPTH_X : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): DEPTHMAX : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): RDis_IX : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): LitCvrQ : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): LitRipCvr_OE : some observations are NaN,
## changing to NA
## Warning in .local(.Object, ...): RVeg_OE : some observations are NaN,
## changing to NA
## NOTE: The following pairs of variables appear to have the same missingness pattern.
##  Please verify whether they are in fact logically distinct variables.
##      [,1]          [,2]         
## [1,] "URBAN"       "LAKE_ORIGIN"
## [2,] "URBAN"       "REF_CLUSTER"
## [3,] "LAKE_ORIGIN" "REF_CLUSTER"
show(mdf2007)
## Object of class missing_data.frame with 1157 observations on 40 variables
## 
## There are 23 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                                  type missing method   model
## PH_FIELD                   continuous       9    ppd  linear
## PH_LAB                     continuous       0   <NA>    <NA>
## COND                       continuous       0   <NA>    <NA>
## ANC                        continuous       0   <NA>    <NA>
## TURB                       continuous       0   <NA>    <NA>
## TOC                        continuous       0   <NA>    <NA>
## DOC                        continuous       0   <NA>    <NA>
## NH4N_PPM                   continuous       0   <NA>    <NA>
## NO3_NO2                    continuous       0   <NA>    <NA>
## NTL_PPM                    continuous       0   <NA>    <NA>
## PTL                        continuous       0   <NA>    <NA>
## CL_PPM                     continuous       0   <NA>    <NA>
## NO3N_PPM                   continuous       0   <NA>    <NA>
## SO4_PPM                    continuous       0   <NA>    <NA>
## CA_PPM                     continuous       0   <NA>    <NA>
## MG_PPM                     continuous       0   <NA>    <NA>
## NA_PPM                     continuous       0   <NA>    <NA>
## K_PPM                      continuous       0   <NA>    <NA>
## COLOR                      continuous       0   <NA>    <NA>
## SIO2                       continuous       0   <NA>    <NA>
## NH4ION                     continuous       0   <NA>    <NA>
## CHLA                       continuous       5    ppd  linear
## SECMEAN                    continuous      65    ppd  linear
## LAT_DD                     continuous       0   <NA>    <NA>
## LON_DD                     continuous       0   <NA>    <NA>
## DO2_2M                     continuous      55    ppd  linear
## AREA_HA                    continuous       0   <NA>    <NA>
## LAKEPERIM                  continuous       0   <NA>    <NA>
## SLD                        continuous       0   <NA>    <NA>
## DEPTH_X                    continuous      17    ppd  linear
## DEPTHMAX                   continuous       1    ppd  linear
## ELEV_PT                    continuous       0   <NA>    <NA>
## RDis_IX                 SC_proportion      57    ppd betareg
## RDis_IX:is_zero                binary      57    ppd   logit
## LitCvrQ                    continuous      60    ppd  linear
## LitRipCvr_OE               continuous      63    ppd  linear
## RVeg_OE                    continuous      60    ppd  linear
## NTL                        continuous       0   <NA>    <NA>
## URBAN                          binary     690    ppd   logit
## LAKE_ORIGIN                    binary     690    ppd   logit
## REF_CLUSTER     unordered-categorical     690    ppd  mlogit
## 
##                      family     link transformation
## PH_FIELD           gaussian identity    standardize
## PH_LAB                 <NA>     <NA>    standardize
## COND                   <NA>     <NA>    standardize
## ANC                    <NA>     <NA>    standardize
## TURB                   <NA>     <NA>    standardize
## TOC                    <NA>     <NA>    standardize
## DOC                    <NA>     <NA>    standardize
## NH4N_PPM               <NA>     <NA>    standardize
## NO3_NO2                <NA>     <NA>    standardize
## NTL_PPM                <NA>     <NA>    standardize
## PTL                    <NA>     <NA>    standardize
## CL_PPM                 <NA>     <NA>    standardize
## NO3N_PPM               <NA>     <NA>    standardize
## SO4_PPM                <NA>     <NA>    standardize
## CA_PPM                 <NA>     <NA>    standardize
## MG_PPM                 <NA>     <NA>    standardize
## NA_PPM                 <NA>     <NA>    standardize
## K_PPM                  <NA>     <NA>    standardize
## COLOR                  <NA>     <NA>    standardize
## SIO2                   <NA>     <NA>    standardize
## NH4ION                 <NA>     <NA>    standardize
## CHLA               gaussian identity    standardize
## SECMEAN            gaussian identity    standardize
## LAT_DD                 <NA>     <NA>    standardize
## LON_DD                 <NA>     <NA>    standardize
## DO2_2M             gaussian identity    standardize
## AREA_HA                <NA>     <NA>    standardize
## LAKEPERIM              <NA>     <NA>    standardize
## SLD                    <NA>     <NA>    standardize
## DEPTH_X            gaussian identity    standardize
## DEPTHMAX           gaussian identity    standardize
## ELEV_PT                <NA>     <NA>    standardize
## RDis_IX            binomial    logit        squeeze
## RDis_IX:is_zero    binomial    logit           <NA>
## LitCvrQ            gaussian identity    standardize
## LitRipCvr_OE       gaussian identity    standardize
## RVeg_OE            gaussian identity    standardize
## NTL                    <NA>     <NA>    standardize
## URBAN              binomial    logit           <NA>
## LAKE_ORIGIN        binomial    logit           <NA>
## REF_CLUSTER     multinomial    logit           <NA>
image(mdf2007)

Unlike 2012 data we have few missing values, except for URBAN, LAKE_ORIGIN, and REF_CLUSTER features, which are not missign at random, and they will not included in the imputation process below.

#library(mice)
imputed_Data.07 <- mice(select.07.data[,2:38], m=3, maxit = 30, method = 'pmm', seed = 500, printFlag = F)
## Warning: Number of logged events: 2
lakes07_summ <- summaries(select.07.data[,2:38])
summary(imputed_Data.07)
## Class: mids
## Number of multiple imputations:  3 
## Imputation methods:
##     PH_FIELD       PH_LAB         COND          ANC         TURB 
##        "pmm"           ""           ""           ""           "" 
##          TOC          DOC     NH4N_PPM      NO3_NO2      NTL_PPM 
##           ""           ""           ""           ""           "" 
##          PTL       CL_PPM     NO3N_PPM      SO4_PPM       CA_PPM 
##           ""           ""           ""           ""           "" 
##       MG_PPM       NA_PPM        K_PPM        COLOR         SIO2 
##           ""           ""           ""           ""           "" 
##       NH4ION         CHLA      SECMEAN       LAT_DD       LON_DD 
##           ""        "pmm"        "pmm"           ""           "" 
##       DO2_2M      AREA_HA    LAKEPERIM          SLD      DEPTH_X 
##        "pmm"           ""           ""           ""           "" 
##     DEPTHMAX      ELEV_PT      RDis_IX      LitCvrQ LitRipCvr_OE 
##        "pmm"           ""        "pmm"        "pmm"        "pmm" 
##      RVeg_OE          NTL 
##        "pmm"           "" 
## PredictorMatrix:
##          PH_FIELD PH_LAB COND ANC TURB TOC DOC NH4N_PPM NO3_NO2 NTL_PPM
## PH_FIELD        0      1    1   1    1   1   1        1       1       1
## PH_LAB          1      0    1   1    1   1   1        1       1       1
## COND            1      1    0   1    1   1   1        1       1       1
## ANC             1      1    1   0    1   1   1        1       1       1
## TURB            1      1    1   1    0   1   1        1       1       1
## TOC             1      1    1   1    1   0   1        1       1       1
##          PTL CL_PPM NO3N_PPM SO4_PPM CA_PPM MG_PPM NA_PPM K_PPM COLOR SIO2
## PH_FIELD   1      1        1       1      1      1      1     1     1    1
## PH_LAB     1      1        1       1      1      1      1     1     1    1
## COND       1      1        1       1      1      1      1     1     1    1
## ANC        1      1        1       1      1      1      1     1     1    1
## TURB       1      1        1       1      1      1      1     1     1    1
## TOC        1      1        1       1      1      1      1     1     1    1
##          NH4ION CHLA SECMEAN LAT_DD LON_DD DO2_2M AREA_HA LAKEPERIM SLD
## PH_FIELD      1    1       1      1      1      1       1         1   1
## PH_LAB        1    1       1      1      1      1       1         1   1
## COND          1    1       1      1      1      1       1         1   1
## ANC           1    1       1      1      1      1       1         1   1
## TURB          1    1       1      1      1      1       1         1   1
## TOC           1    1       1      1      1      1       1         1   1
##          DEPTH_X DEPTHMAX ELEV_PT RDis_IX LitCvrQ LitRipCvr_OE RVeg_OE NTL
## PH_FIELD       0        1       1       1       1            1       1   0
## PH_LAB         0        1       1       1       1            1       1   0
## COND           0        1       1       1       1            1       1   0
## ANC            0        1       1       1       1            1       1   0
## TURB           0        1       1       1       1            1       1   0
## TOC            0        1       1       1       1            1       1   0
## Number of logged events:  2 
##   it im dep      meth     out
## 1  0  0     collinear     NTL
## 2  0  0     collinear DEPTH_X
#densityplot(imputed_Data.07)

Let us plot extracted the imputed dataframes and check the results summary statistics with the original data frame.

#plot(imputed_Data)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
## 
##     complete
## The following object is masked from 'package:mi':
## 
##     complete
## The following object is masked from 'package:Matrix':
## 
##     expand
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(ggplot2)
imp.07.1 <- mice::complete(imputed_Data.07,1)
imp_mis <- missing_data.frame(imp.07.1)
image(imp_mis)

imp.07.1_sum <- summaries(imp.07.1)
# diff1.7 <- imp.07.1_sum - lakes07_summ

imp.07.2 <- mice::complete(imputed_Data.07,2)
imp.07.2_sum <- summaries(imp.07.2)
diff2.07 <- imp.07.2_sum - lakes07_summ

imp.07.3 <- mice::complete(imputed_Data.07,3)
imp.07.3_sum <- summaries(imp.07.3)
diff3.07 <- imp.07.3_sum - lakes07_summ
par(mfrow=c(7,6), mai=c(.3,.3,.35,.1), omi=c(.2,.5,.5,.1) )
for (i in 1:37) {
  plot(density(select.07.data[,2:38][,i], na.rm = T), main="", cex=1.2, xlab="")
  mtext(names(imp.07.1)[i], bg="gray", line = 1)
  lines(density(imp.07.1[,i], na.rm = T), col="tomato", lwd=1)
  lines(density(imp.07.2[,i], na.rm = T), col="green", lwd=1)
  lines(density(imp.07.3[,i], na.rm = T), col="blue", lwd=1)
}
mtext("Density plots of the original and the imputed values for 2007 data", 
      side=3, outer=T, line=1, cex=1.5, col = "darkred")
mtext("Density", side=2, outer=TRUE, line=1, cex=1.2, col="darkred")

The above density plot indicate that all of the imputed data match with each other and the original data, thus, I will take the average of the three imputed results for downstream process.

# Let's take the average of the three results
lakes07_complete <- (imp.07.1 + imp.07.2 + imp.07.3)/3.0

Exploratory Data Analysis

For the remaining sections I will focus on using mainly the 2012 data. Either I can aggument the 2012 data for prediction or I will use it to test the models.

Univariate visualization

library(reshape2)
# Let's take the average of the three results
lakes_complete <- (imp1 + imp2 + imp3)/3.0
lakes_complete$URBAN <- imp1$URBAN
lakes_complete$LAKE_ORIGIN <- imp1$LAKE_ORIGIN

final_lakes <- cbind(UID=select_lakes2$UID, lakes_complete)
#write.csv(final_lakes, "final.raw.12.csv")

final_summ <- summaries(lakes_complete[,1:43])

final_summ %>% kable()%>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
min max sd mean NAs
ALUMINUM 0.000 2.480 0.086 0.021 0
AMMONIA_N 0.000 3.180 0.133 0.037 0
ANC -3361.400 203857.000 10555.249 2781.494 0
CALCIUM 0.121 594.900 45.052 28.373 0
CHLORIDE 0.036 18012.742 550.409 50.257 0
COLOR 0.000 840.000 41.838 25.782 0
COND 2.820 64810.000 3233.833 694.122 0
DOC 0.230 515.810 19.272 8.476 0
MAGNESIUM 0.067 1023.000 66.561 24.568 0
NITRATE_N 0.000 51.660 1.575 0.105 0
NITRITE_N 0.000 0.419 0.013 0.001 0
NTL 0.014 54.000 2.579 1.142 0
PH 2.830 10.470 0.850 8.079 0
POTASSIUM 0.041 3376.000 127.025 12.351 0
PTL 4.000 3636.000 277.942 116.397 0
SILICA 0.022 935.000 30.861 10.969 0
SODIUM 0.093 29890.000 1134.019 112.550 0
SULFATE 0.026 47325.202 1741.119 200.493 0
TURB 0.010 403.600 24.646 9.200 0
CHLX 0.000 764.640 55.418 26.186 0
CHLL 0.000 584.640 58.786 26.981 0
OXYGEN 0.250 23.686 2.372 6.753 0
TEMPERATURE 6.659 109.100 6.673 21.563 0
DEPTH 0.000 25.000 3.524 3.190 0
CONDUCTIVITY 1.780 68863.636 3667.320 733.781 0
SECCHI 0.020 28.000 2.314 2.177 0
AREA_HA 1.033 167489.609 7501.797 861.475 0
ELEVATION -53.270 3594.970 758.063 646.201 0
ABUNDANCE 267.000 67860.000 5537.304 3683.193 0
PERIM_KM 0.394 3520.774 125.610 21.095 0
INDEX_LAT_DD 26.070 48.985 4.921 40.887 0
INDEX_LON_DD -124.230 -67.194 14.773 -95.437 0
MMI_BENT_NLA12 0.000 86.100 15.269 43.006 0
MMI_ZOOP_NLA6 0.050 94.650 17.405 55.777 0
AMITOTAL 0.000 1.795 0.337 0.352 0
LitCvrQc3OE 0.000 5.636 0.688 0.883 0
LitRipCvrQc3OE 0.000 4.696 0.453 0.736 0
RDis_IX 0.000 0.950 0.278 0.482 0
RVegQc3OE 0.000 4.532 0.497 0.704 0
MICX 0.000 66.690 3.543 0.675 0
ATRAZINE 0.000 9.700 0.484 0.125 0
METHYLMERCURY 0.080 19.000 1.541 1.249 0
TOTALHG 4.840 859.730 93.917 106.116 0
lakes_long <- melt(final_lakes[,1:44], id.var = "UID")
head(lakes_long)
##       UID variable value
## 1 1000001 ALUMINUM 0.007
## 2 1000010 ALUMINUM 0.020
## 3 1000011 ALUMINUM 0.012
## 4 1000013 ALUMINUM 0.031
## 5 1000014 ALUMINUM 0.000
## 6 1000015 ALUMINUM 2.480
ggplot(lakes_long, aes(x=value))+
  geom_histogram(aes(y=..density.., fill="chartreuse4", color="chartreuse4"), bins = 20)+
  geom_density(fill="deepskyblue", alpha=0.2)+
  facet_wrap(~variable, ncol = 5, scales = "free")+
  theme_bw()+guides(fill=F, color=F)

From the above plot we can see that most of the histograms are highly right-skewed with most of the data having very low value and few high values. This is normal in such kind of environmental systems indicating most of the lakes have low baseline concentration and few of the lakes have high values that could be an indication of human influence or the characteristic of that specific lake. To make the data more or less similar to normal distribution we can log transform the skewed features.

In addition, some of the features have value of zero this could be that it is actually zero or could be as a result of detection limit. The EPA reports the detection limit for each feature, the detection limit for each feature varies. In order to transform we want the features to be strickly positive, and for this we will add the smallest detection limit of all the feature. As noted in the summary section the elevation and ANC features have some negative values, to deal with this I will use cube root transformation on them before using log transformation.

library(dplyr)

# Add small purturbation to the data to avoid the zeros in the data
small_c <- 0.002
final_lakes.c <- final_lakes[,2:44] + (final_lakes[,2:44] < 0.002)*small_c

# Check if this has a significant effect on the data
new_summ <- summaries(final_lakes.c)
sum_new <-  new_summ - summaries(final_lakes[,2:44])
head(sum_new,10)
##             min max sd  mean NAs
## ALUMINUM  0.002   0  0 0.001   0
## AMMONIA_N 0.002   0  0 0.000   0
## ANC       0.002   0  0 0.000   0
## CALCIUM   0.000   0  0 0.000   0
## CHLORIDE  0.000   0  0 0.000   0
## COLOR     0.002   0  0 0.000   0
## COND      0.000   0  0 0.000   0
## DOC       0.000   0  0 0.000   0
## MAGNESIUM 0.000   0  0 0.000   0
## NITRATE_N 0.002   0  0 0.001   0

After inspecting the difference between the pruturbed and the new data, there is a slight change which is in acceptable range. Now we can start transforming the data either using log or cube root, whichever works on the specific feature. Looking at the data two features, Elevation and ANC (acid neutralizing capacity), have some negative values, and we can apply log transformation directly. Either we have to shift the data to be stricktly positive or we can try cube root transformation. First, I will try to use cube root transformation on these features, and if the result is not satisfactory, we can try to shift and log transform.

# Copy the data frame
transform_lake.df <- final_lakes.c
# Cube root transform first
cube.r <- function(x){sign(x)*abs(x)^(1/3)}
transform_lake.df$ANC <- cube.r(transform_lake.df$ANC)
transform_lake.df$ELEVATION <- cube.r(transform_lake.df$ELEVATION)

Check their histogram before and after transformation.

par(mfrow=c(2,2))
hist(final_lakes.c$ELEVATION, breaks=50, main="Original", col="#18BF5C", freq=F, xlab="Elevation")
lines(density(final_lakes.c$ELEVATION), col="darkred", lwd=3); box()
hist(transform_lake.df$ELEVATION, main="Transformed", breaks=50, col="#18BF5C", freq=F, xlab="Elevation")
lines(density(transform_lake.df$ELEVATION), col="darkred", lwd=3); box()

hist(final_lakes.c$ANC, breaks=50, main="Original", col="#18BF5C", xlab="ANC"); box()
lines(density(final_lakes.c$ANC), col="darkred", lwd=3); box()
hist(transform_lake.df$ANC, main="Transformed", breaks=50, col="#18BF5C", freq=F, xlab="ANC")
lines(density(transform_lake.df$ANC), col="darkred", lwd=3); box()

As it can be seen in the above histograms the elevation already got transformed effectively to gaussian distribution, while the ANC might still needs additional transformation, but I will leave it as it is.

# Variables that don't need transformation
ok_vars <- c("ANC","PH", "OXYGEN", "ELEVATION","INDEX_LAT_DD", "INDEX_LON_DD", "MMI_BENT_NLA12", "MMI_ZOOP_NLA6", "RDis_IX", "URBAN", "LAKE_ORIGIN")

# Variables that need transformation
log_vars <- c("ALUMINUM","AMMONIA_N", "CALCIUM", "CHLORIDE", "COLOR", "COND", "DOC", "MAGNESIUM", "NITRATE_N","NITRITE_N","NTL", "POTASSIUM", "PTL","SILICA", "SODIUM", "SULFATE","TURB", "CHLX", "CHLL", "TEMPERATURE", "DEPTH", "CONDUCTIVITY", "SECCHI", "AREA_HA", "ABUNDANCE", "PERIM_KM", "AMITOTAL", "LitCvrQc3OE", "LitRipCvrQc3OE", "RVegQc3OE", "MICX","ATRAZINE","METHYLMERCURY", "TOTALHG")

#for (i in 1:length(log_vars)) {
#  transform_lake.df[,log_vars[i]] <- log(transform_lake.df[,log_vars[i]])
#}

transform_lake.df[, log_vars] <- log(dplyr::select(transform_lake.df, log_vars))

transform_lake.df$UID <- final_lakes$UID
#trans_summ <- summaries(transform_lake.df)
ts_lake_long <- melt(transform_lake.df, id.var = "UID")

ggplot(ts_lake_long, aes(x=value))+
  geom_histogram(aes(y=..density.., fill="#81D2FF", color="#307D00"), bins = 10)+
  geom_density(fill="#FFB05A", alpha=0.2)+
  facet_wrap(~variable, ncol = 5, scales = "free")+
  guides(fill=F, color=F) + theme_bw()

As it can be seen, most of the features have a unimodal normal distribution except few. For now, I am going to leave it as it is and proceed with the next section.

Bivariate plot

library(corrplot)
## corrplot 0.84 loaded
# Compute correlation table
lakes_cor <- cor(transform_lake.df[,-44])
#dev.new(width=10, height=10, unit='in')
clrs <- colorRampPalette(c("#B26205","#FFB05A", "white","#81D2FF", "#0980B2"))
corrplot(lakes_cor, type = "lower", tl.pos = "ld", mar = c(1,1,1,1), 
         tl.col="black", addCoef.col = "black", tl.cex = 0.9, 
         diag = F, method = "square",col=clrs(100), number.cex = 0.55)

Next, let us filter highly correlated features and visualize them.

# Filter those with high correlation
threshold <- 0.85
high_cor <- lakes_cor
diag(high_cor) <- 0
ok <- apply(abs(high_cor) >= threshold, 1, any)
high_cor <-lakes_cor[ok,ok]

corrplot(high_cor, type = "lower",tl.cex = 1.2, tl.pos = "ld", mar = c(1,1,1,1), 
         addCoef.col = "black", method = "square", number.cex = 1, 
         tl.col="black", col=clrs(100), diag = F)

Most of the features having high correlations are expected fron what we know about water quality parameters. For example, CHLL is littoral chlorophyl-a concentration and CHLX is chlorophyl-a concentration in the lake, and this two are expected to have high correlation. In addition, if we look at COND (condactivity of lake’s water) which is due to the ions present in water such as calcium. chloride, sodium, etc., we expect high correlation with them. A more detailed plot is also shown below.

library(psych)
high_cor_names <- colnames(high_cor) 
high_cor_lakes <- subset(transform_lake.df,select = high_cor_names)

pairs.panels(high_cor_lakes, 
             main="Pairs plot for highly correlated features",
             method = "pearson", # correlation method
             hist.col = "#0980B2",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Map plot using plotly

We can also see the exact locations of the the sampled lakes as shown below. We can zoom and hover on the points to see the total nitrogen, total phsophorus and area of the lakes.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#final_lakes.ts <- transform_lake.df
# geo styling
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showland = TRUE,
  showlakes = TRUE,
  landcolor = toRGB("gray95"),
  subunitcolor = toRGB("gray85"),
  countrycolor = toRGB("gray85"),
  countrywidth = 0.5,
  subunitwidth = 0.5
)


p <- plot_geo(final_lakes, lat = ~INDEX_LAT_DD, lon = ~INDEX_LON_DD, color= ~CHLX) %>%
  add_markers(
    text = ~paste(paste("Total nitrogen:",round(NTL,2)), paste("Total phosphorus:",round(PTL,2)), paste("Area (HA):", round(AREA_HA,2)),sep = "<br />"),
    symbol = I("square"), size = ~AREA_HA, hoverinfo = "text"
  ) %>%
  colorbar(title = "Chlorophyl a ammount<br />In the lake") %>%
  layout(
    title = 'US lakes<br />Surveyed in 2012', geo = g
  )
## Warning: `line.width` does not currently support multiple values.
p

Interactive visualization

I have developed additional interactive visualization for the data using R shiny app, which I hosted online at [https://zerihun.shinyapps.io/US_lakes_classification/]. This app is useful for additional visualization and exploration of features relationship.

Principal components analysis

In this section, I will work on reducing the dimensionality of the data using principal component analysis. If we are able to reduce the dimension significantly to a lower dimension, we can use the new principal components to visualize and training downstream machine learning algorithms.

To do the principal component analysis I will use the transformed data, however, it is also possible to use the original data and still obtain the same result. Since all the features have different measurement units and range, I will center and scale the data before doing principal component analysis.

#lakes12 <- read.csv("lakes_data2012.csv", header = T)
lakes12 <- transform_lake.df
# The target variable is the lakes Chlorophyl concentration
chl <- lakes12$CHLL+lakes12$CHLX
lakes12$Chlorophyl <- chl
lakes12$URBAN <- final_lakes$URBAN
lakes12$LAKE_ORIGIN <- final_lakes$LAKE_ORIGIN

lakes12 <- select(lakes12, UID, everything(), -CHLL, -CHLX)

# Save the data frame to file for future use.
write.csv(lakes12, "lakes12.ts.csv")

# Let us use the numerical variables for principal component analysis
x <- lakes12[,2:42]  # Deselect the response feature, URBAN, LAKE_ORIGIN and UID

pca1 <- prcomp(data.matrix(x), scale. = T, center = T)
summary(pca1)
## Importance of components:
##                          PC1     PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     3.513 1.93628 1.59032 1.57528 1.3941 1.3293 1.23954
## Proportion of Variance 0.301 0.09144 0.06169 0.06052 0.0474 0.0431 0.03747
## Cumulative Proportion  0.301 0.39243 0.45412 0.51464 0.5620 0.6051 0.64261
##                            PC8     PC9    PC10   PC11   PC12    PC13
## Standard deviation     1.11843 1.03857 0.95748 0.9279 0.9078 0.87297
## Proportion of Variance 0.03051 0.02631 0.02236 0.0210 0.0201 0.01859
## Cumulative Proportion  0.67312 0.69943 0.72179 0.7428 0.7629 0.78148
##                           PC14    PC15    PC16    PC17    PC18   PC19
## Standard deviation     0.85203 0.84381 0.80187 0.77956 0.76675 0.7495
## Proportion of Variance 0.01771 0.01737 0.01568 0.01482 0.01434 0.0137
## Cumulative Proportion  0.79918 0.81655 0.83223 0.84705 0.86139 0.8751
##                           PC20    PC21    PC22    PC23    PC24    PC25
## Standard deviation     0.72856 0.69242 0.64569 0.63963 0.61449 0.59701
## Proportion of Variance 0.01295 0.01169 0.01017 0.00998 0.00921 0.00869
## Cumulative Proportion  0.88804 0.89973 0.90990 0.91988 0.92909 0.93778
##                           PC26    PC27    PC28    PC29    PC30    PC31
## Standard deviation     0.55883 0.54193 0.53672 0.49957 0.49157 0.46148
## Proportion of Variance 0.00762 0.00716 0.00703 0.00609 0.00589 0.00519
## Cumulative Proportion  0.94540 0.95256 0.95959 0.96568 0.97157 0.97677
##                           PC32    PC33    PC34    PC35    PC36    PC37
## Standard deviation     0.44109 0.40130 0.38495 0.34599 0.31190 0.27862
## Proportion of Variance 0.00475 0.00393 0.00361 0.00292 0.00237 0.00189
## Cumulative Proportion  0.98151 0.98544 0.98905 0.99197 0.99435 0.99624
##                           PC38    PC39    PC40    PC41
## Standard deviation     0.25157 0.21348 0.17156 0.12612
## Proportion of Variance 0.00154 0.00111 0.00072 0.00039
## Cumulative Proportion  0.99778 0.99889 0.99961 1.00000
#plot(pca1, type="line", main = "Scree plot")

library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_eig(pca1, ncp = 20)

# Plot commulative proportion of variance explained by the PCs
pc_var <- pca1$sdev^2
pct_var <- 100*pc_var/sum(pc_var)
plot(c(0, cumsum(pct_var)), xlab = "Principal Components", 
     main = "Cummulative proportion of variance explained",
     ylab = "% cummulative variance",
     type = "o", col="dark red",
     ylim = c(0,100))

We see that there is no clear “elbow” in the PCs where their contribution becomes steady. In order to get 100% of the variation we have to actually use all the 41 PCs and if we want only 95% we need to use 26 out of 41 PCs.

Since we have observed that most of the variables have low correlation, and as expected more PCAs are required to explain most of the variations in the concentration of Chlorophyll-a. At least to about 20 PCs are required to get over 80% of the variation in the chlorophl concentration, which is a high dimension by itself.

fviz_pca_var(pca1,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel =F, # Avoid text overlapping
             select.var = list(cos2=10), # top10 according to cos val
             col.var = "darkred"
             )

From this plot we can see that the secchi depth is negatively correlated to both principal components. This result makes sense, because secchi depth measure the transparency depth of the lake. Hence, the higher the secchi indicates a more clean lake and hence low in chlorophyll-a concentration.

Is there any difference between manmade and naturally occurring lakes?

The next plot tries to answer if there is noticable difference or clustering between man made and natural lakes. As it can be seen from the plot there is no noticable water quality and feature difference between manmade and natural lakes.

fviz_pca_biplot(pca1,
                title= "PCA biplot",
                label ="var",
                habillage=lakes12$LAKE_ORIGIN,
                addEllipses=TRUE, 
                ellipse.level=0.95
                )

Linear regression

I will start the regression process by first training a multiple linear model with all the predictor features included. This model can further be enhanced by feature selection downstream also it could be used a reference model other more complex models.

library(caret)
# Split the data to training and testing
lakes.df <- select(lakes12, -UID)
#lakes.df <- lakes.12[,2:45]
training <- caret::createDataPartition(lakes.df$URBAN, p=0.85, list = F)
lake_train<-lakes.df[training, ]
lake_test<-lakes.df[-training, ]

chl.lm <- lm(Chlorophyl ~ ., data=lake_train)
summary(chl.lm)
## 
## Call:
## lm(formula = Chlorophyl ~ ., data = lake_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1951 -0.7788  0.0229  0.7965  5.5421 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.9434254  1.9336197   1.005 0.315127    
## ALUMINUM           -0.1222922  0.0439032  -2.785 0.005455 ** 
## AMMONIA_N          -0.0117464  0.0609677  -0.193 0.847263    
## ANC                 0.0026027  0.0212423   0.123 0.902510    
## CALCIUM             0.1776504  0.0965881   1.839 0.066200 .  
## CHLORIDE            0.2479604  0.0605269   4.097 4.56e-05 ***
## COLOR               0.0165917  0.0551844   0.301 0.763742    
## COND               -0.6206814  0.2086478  -2.975 0.003009 ** 
## DOC                -0.2250675  0.1279957  -1.758 0.079013 .  
## MAGNESIUM           0.0321815  0.0822951   0.391 0.695851    
## NITRATE_N          -0.2202193  0.0389125  -5.659 2.03e-08 ***
## NITRITE_N           0.1512179  0.1766684   0.856 0.392253    
## NTL                 1.1256084  0.1334348   8.436  < 2e-16 ***
## PH                  0.1009699  0.0970588   1.040 0.298476    
## POTASSIUM           0.0138769  0.0670881   0.207 0.836176    
## PTL                 0.3461666  0.0775577   4.463 9.07e-06 ***
## SILICA             -0.0947985  0.0426004  -2.225 0.026304 *  
## SODIUM             -0.1089295  0.0967305  -1.126 0.260412    
## SULFATE            -0.0371353  0.0411725  -0.902 0.367322    
## TURB                0.4811859  0.0650241   7.400 3.07e-13 ***
## OXYGEN              0.0192284  0.0221010   0.870 0.384516    
## TEMPERATURE        -0.2116252  0.2506554  -0.844 0.398729    
## DEPTH               0.0320690  0.0538751   0.595 0.551824    
## CONDUCTIVITY        0.0215893  0.0576507   0.374 0.708131    
## SECCHI             -0.5203569  0.1021038  -5.096 4.20e-07 ***
## AREA_HA            -0.3473154  0.0941156  -3.690 0.000237 ***
## ELEVATION          -0.0743355  0.0227353  -3.270 0.001117 ** 
## ABUNDANCE           0.5650820  0.0637629   8.862  < 2e-16 ***
## PERIM_KM            0.4769289  0.1373432   3.473 0.000540 ***
## INDEX_LAT_DD       -0.0055581  0.0134181  -0.414 0.678808    
## INDEX_LON_DD        0.0003126  0.0045613   0.069 0.945379    
## MMI_BENT_NLA12     -0.0095538  0.0036447  -2.621 0.008904 ** 
## MMI_ZOOP_NLA6       0.0029671  0.0035713   0.831 0.406297    
## AMITOTAL           -0.0612869  0.0310069  -1.977 0.048391 *  
## LitCvrQc3OE         0.0057995  0.0887265   0.065 0.947899    
## LitRipCvrQc3OE      0.0027109  0.1653508   0.016 0.986923    
## RDis_IX            -0.1800969  0.1902459  -0.947 0.344065    
## RVegQc3OE           0.0470245  0.0829375   0.567 0.570862    
## MICX                0.0273046  0.0246226   1.109 0.267752    
## ATRAZINE            0.0071383  0.0243428   0.293 0.769405    
## METHYLMERCURY       0.0142991  0.0607254   0.235 0.813895    
## TOTALHG             0.0916517  0.0889921   1.030 0.303335    
## URBANYes            0.0671688  0.1144934   0.587 0.557576    
## LAKE_ORIGINNATURAL -0.1220843  0.1222409  -0.999 0.318194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.363 on 918 degrees of freedom
## Multiple R-squared:  0.7889, Adjusted R-squared:  0.779 
## F-statistic: 79.78 on 43 and 918 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(chl.lm, which = 1:2)

Observation on the training model

As it can be seen from the summary result on the training, there 15 features which have significant (p>0.05) linear relationship with output variable chlorophyll, but most of them (27 features) don’t seem to have linear relationship with chlorophyl concentration.

Both \(R^2\) and adjusted \(R^2\) have a moderate value close to 0.8, which can be okay compared to the complexity of the model.

The residuals plot seems reasonable accross the middle range of chlorophyll values. However, the model devaits for low and high values of chlorophyll. In addition, the qqplot indicates there are outliers at both ends, but in most part of the data, in the middle range, it look normally distributed.

The model also have AIC value of 3371.01 and BIC of 3590.12. We can use this values to compare for feature model selection comparison.

Evaluation of multiple linear regression on the test data

To know how strongly the linear model trained above performed, we will use the test data and evualate its performance. Later, we will use this to compare it with other models.

performance <- function(y.obs, y.pred){
   model.rmse <- RMSE(y.obs, y.pred)
   model.cor <- cor(y.obs, y.pred)
   model.mae <- mean(abs(y.obs - y.pred))
   model.r2 <- 1- sum((y.obs - y.pred)^2)/sum((y.obs - mean(y.obs))^2)
   return(rbind(model.rmse, model.cor, model.mae, model.r2))
 }

y.pred <- predict(chl.lm, lake_test )

# Models performance metrics
ml.perf <- performance(lake_test$Chlorophyl, y.pred)
ml.perf
##                 [,1]
## model.rmse 1.3691218
## model.cor  0.8915594
## model.mae  0.9881585
## model.r2   0.7944393

The model performance as expected decreased slightly on the test data, however, we still get fair performance with good correlation between the test and predicted values. The model gives a mean absolute error of 0.9881585, which corresponds to an error of \({e}^{chlorophll}=\) 2.686283 ug/L in chlorophyll concentration. Taking into consideration the range of chlorphlyll concentration, which is from 0.4406993 to 1166, this error can be considered acceptable.

Stepwise linear model selection

From the above multiple linear regression result we have observed that only 15 features had a significant linear relationship with the response variable. Hence, we can further narrow down the number of features and select only the important features for the linear model using the step function to improve it further. I will use the step function with both backward and forward selection to the feature selection.

# Define a base model - intercept only
base.mod <- lm(Chlorophyl ~ 1, data=lake_train)
# Define the full model - including all predictors
all.mod <- lm(Chlorophyl ~ ., data=lake_train)

lakes_step <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
summary(lakes_step)
## 
## Call:
## lm(formula = Chlorophyl ~ SECCHI + NTL + ABUNDANCE + TURB + ELEVATION + 
##     NITRATE_N + DOC + PTL + SILICA + MMI_BENT_NLA12 + COND + 
##     CHLORIDE + CALCIUM + ALUMINUM + AMITOTAL + RVegQc3OE + LAKE_ORIGIN + 
##     SODIUM, data = lake_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4683 -0.7674  0.0238  0.8267  5.7185 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.391495   0.752253   1.850 0.064659 .  
## SECCHI             -0.501315   0.091961  -5.451 6.38e-08 ***
## NTL                 1.256190   0.116888  10.747  < 2e-16 ***
## ABUNDANCE           0.568734   0.061356   9.269  < 2e-16 ***
## TURB                0.466093   0.062380   7.472 1.80e-13 ***
## ELEVATION          -0.083296   0.018339  -4.542 6.29e-06 ***
## NITRATE_N          -0.226141   0.032643  -6.928 7.92e-12 ***
## DOC                -0.324669   0.109876  -2.955 0.003206 ** 
## PTL                 0.338990   0.072154   4.698 3.02e-06 ***
## SILICA             -0.094126   0.040224  -2.340 0.019489 *  
## MMI_BENT_NLA12     -0.009974   0.003515  -2.838 0.004637 ** 
## COND               -0.594700   0.135480  -4.390 1.26e-05 ***
## CHLORIDE            0.241610   0.053448   4.520 6.96e-06 ***
## CALCIUM             0.163125   0.080454   2.028 0.042887 *  
## ALUMINUM           -0.130094   0.039055  -3.331 0.000899 ***
## AMITOTAL           -0.057223   0.028024  -2.042 0.041438 *  
## RVegQc3OE           0.081985   0.038706   2.118 0.034424 *  
## LAKE_ORIGINNATURAL -0.213220   0.099790  -2.137 0.032880 *  
## SODIUM             -0.136208   0.083607  -1.629 0.103617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.364 on 943 degrees of freedom
## Multiple R-squared:  0.783,  Adjusted R-squared:  0.7789 
## F-statistic: 189.1 on 18 and 943 DF,  p-value: < 2.2e-16
lakes_step
## 
## Call:
## lm(formula = Chlorophyl ~ SECCHI + NTL + ABUNDANCE + TURB + ELEVATION + 
##     NITRATE_N + DOC + PTL + SILICA + MMI_BENT_NLA12 + COND + 
##     CHLORIDE + CALCIUM + ALUMINUM + AMITOTAL + RVegQc3OE + LAKE_ORIGIN + 
##     SODIUM, data = lake_train)
## 
## Coefficients:
##        (Intercept)              SECCHI                 NTL  
##           1.391495           -0.501315            1.256190  
##          ABUNDANCE                TURB           ELEVATION  
##           0.568734            0.466093           -0.083296  
##          NITRATE_N                 DOC                 PTL  
##          -0.226141           -0.324669            0.338990  
##             SILICA      MMI_BENT_NLA12                COND  
##          -0.094126           -0.009974           -0.594700  
##           CHLORIDE             CALCIUM            ALUMINUM  
##           0.241610            0.163125           -0.130094  
##           AMITOTAL           RVegQc3OE  LAKE_ORIGINNATURAL  
##          -0.057223            0.081985           -0.213220  
##             SODIUM  
##          -0.136208

We can see that step wise linear model selection has selected 18 features/variables as relevant to the linear model, which by itself is high dimension. Next, we can calculate the importance ranking of these features and visualize it.

#library(mlbench)

# estimate variable importance
predStepwise <- varImp(lakes_step, scale=FALSE)

#Change names
rownames(predStepwise)[11] <- "URBAN"
rownames(predStepwise)[14] <- "LAKE_ORIGIN"
par(mai=c(1.5,1,0.2,0.1))
bar.df <- predStepwise[order(predStepwise, decreasing = T),]
lm.bar <- barplot(as.matrix(bar.df), 
                  main="Important features selected by stepwise linear selection",
                  ylab = "Feature importance",
                  beside = T, ylim=c(0,12),
                  names.arg = rownames(predStepwise)[order(predStepwise, decreasing = T)], 
                  las=2,
                  col=clrs(21),
                  cex.names = 0.8)

Let us now evaluate the performance of the stepwise linear model on the test data.

lm.step.pred <- predict(lakes_step, lake_test)

# Models performance metrics
lm.step.perf <- performance(lake_test$Chlorophyl, lm.step.pred )
lm.step.perf
##                 [,1]
## model.rmse 1.3887621
## model.cor  0.8882759
## model.mae  0.9792680
## model.r2   0.7884994
perform.table <- data.frame(lm.all=ml.perf, lm.step=lm.step.perf)

Looking at the correlation we get a reasonably good result and similar to the multiple linear regression which used all the features.

We can further check the performance of the modeling using a 10 fold cross-validation with the help of cv.lm function from DAAG package as shown below.

library(DAAG)
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:psych':
## 
##     cities
lakes.cv.lm <- cv.lm(data=lake_train, lakes_step, m=10, printit=F, seed=30)
## Warning in cv.lm(data = lake_train, lakes_step, m = 10, printit = F, seed = 30): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

lm.step.cv.perf <- performance(lakes.cv.lm$Chlorophyl, lakes.cv.lm$cvpred)
lm.step.cv.perf
##                 [,1]
## model.rmse 1.3947268
## model.cor  0.8766364
## model.mae  1.0333327
## model.r2   0.7684066
perform.table$lm.step.cv <- lm.step.cv.perf

With the 10 fold CV get even a relatively better result from the stepwise linear model with \(R^2\) now 0.7684066 and the correlation between the test values and predicted close to 0.9.

Therefore, we can concluded that multiple linear regression with the 19 features selected give a reasonable or acceptable results. However, the model performance well when the chlorophyll concentration as in the middle range. Thus, other non-linear machine learning algorthims such artificial neural network (ANN), generalized additive model (GAM), etc. might need to be explored to get imporved results.

Feature selection

I have done feature selection in the previous section using the stepwise approach. In this section, I will further explore the two commonly used features selection methods, namely Boruta and LASSO.

Feature selection using Boruta

library(Boruta)
## Loading required package: ranger
set.seed(123)
lakes.boru<-Boruta(Chlorophyl~., data=lake_train, doTrace=0)
print(lakes.boru)
## Boruta performed 99 iterations in 1.272344 mins.
##  33 attributes confirmed important: ABUNDANCE, AMITOTAL,
## AMMONIA_N, ANC, ATRAZINE and 28 more;
##  8 attributes confirmed unimportant: AREA_HA, LitCvrQc3OE,
## LitRipCvrQc3OE, NITRITE_N, PERIM_KM and 3 more;
##  2 tentative attributes left: ALUMINUM, LAKE_ORIGIN;
par(mai=c(1.8,0.9,0.5,0.2))
plot(lakes.boru, xlab="", xaxt="n")

lz<-lapply(1:ncol(lakes.boru$ImpHistory), function(i)
lakes.boru$ImpHistory[is.finite(lakes.boru$ImpHistory[, i]), i])

names(lz)<-colnames(lakes.boru$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(lakes.boru$ImpHistory), cex.axis=0.8, font = 1)
title("Feature importance as predicted by Boruta", col.main="darkgreen")

Boruta predicts 32 of the features out of 43 as important, which did not effectively reduce the features as compared to the stepwise selection. This might be due to the Boruta’s failure to narrow it down, or it could also show that problem is really high dimensional and most of the features are important.

getConfirmedFormula(lakes.boru)
## Chlorophyl ~ AMMONIA_N + ANC + CALCIUM + CHLORIDE + COLOR + COND + 
##     DOC + MAGNESIUM + NITRATE_N + NTL + PH + POTASSIUM + PTL + 
##     SILICA + SODIUM + SULFATE + TURB + OXYGEN + TEMPERATURE + 
##     DEPTH + CONDUCTIVITY + SECCHI + ELEVATION + ABUNDANCE + INDEX_LAT_DD + 
##     INDEX_LON_DD + MMI_BENT_NLA12 + MMI_ZOOP_NLA6 + AMITOTAL + 
##     MICX + ATRAZINE + METHYLMERCURY + TOTALHG
## <environment: 0x0000000015b44e20>
predBoruta <- getSelectedAttributes(lakes.boru, withTentative = F)

Let us also compare the results of Boruta with the stepwise feature selection.

intersect(predBoruta, rownames(predStepwise))
##  [1] "CALCIUM"        "CHLORIDE"       "DOC"            "NITRATE_N"     
##  [5] "NTL"            "PTL"            "SILICA"         "SODIUM"        
##  [9] "TURB"           "SECCHI"         "ELEVATION"      "ABUNDANCE"     
## [13] "MMI_BENT_NLA12" "AMITOTAL"

Here we can see that there is agreement with stepwise linear selection and boruta with 15 features commonly selected as important.

Next, we will use LASSO regularization to do the feature selection and compare it with the previous two methods.

Feature selection using LASSO

#### 10-fold cross validation ####
# LASSO
library("glmnet")
## Loading required package: foreach
## Loaded glmnet 2.0-16
library(doParallel)
## Loading required package: iterators
registerDoParallel(3)
set.seed(1234)  # set seed 
# (10-fold) cross validation for the LASSO

# Select the predictor features data only
xTrain <- data.matrix(select(lake_train,-Chlorophyl))

# Binarize the categorical features
urban <- ifelse(lake_train$URBAN == "Yes", 1,0)
lake_origin <- ifelse(lake_train$LAKE_ORIGIN == "Natural", 1,0)

# Replace the categorical data with binarized data
xTrain[,42:43] <- cbind(urban,lake_origin)
dim(xTrain)
## [1] 962  43
#Do the same on the test data
xTest <- data.matrix(select(lake_test,-Chlorophyl))
urban <- ifelse(lake_test$URBAN == "Yes", 1,0)
lake_origin <- ifelse(lake_test$LAKE_ORIGIN == "Natural", 1,0)
xTest[,42:43] <- cbind(urban,lake_origin)
dim(xTest)
## [1] 168  43
yTrain <- lake_train$Chlorophyl
yTest <- lake_test$Chlorophyl

cvLASSO <- cv.glmnet(data.matrix(xTrain), yTrain, alpha = 1, parallel=TRUE)
par(mai=c(1,1,1,0.1))
plot(cvLASSO)
mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

# Report MSE LASSO
predLASSO <-  predict(cvLASSO, s = cvLASSO$lambda.1se, newx = xTest)
LASSO.perf <- performance(yTest, predLASSO)
LASSO.perf
##                    1
## model.rmse 1.4310867
##            0.8836414
## model.mae  1.0736569
## model.r2   0.7754114
perform.table$LASSO <- LASSO.perf

Again here, we see that using LASSO resulted in a similar performance compared to the stepwise linear regression.

library(kableExtra)
# lambda that gives best performance on misclassification error
bestlambda <- cvLASSO$lambda.min

# extract parameters at each of these lambda values
coefs.bestLambda <- as.matrix(coef(cvLASSO, s = bestlambda))

# which coefficients are non-zero
coef.nonzero <- (data.frame(variables=rownames(coefs.bestLambda)[which(coefs.bestLambda > 0)], coefs=coefs.bestLambda[ which( coefs.bestLambda > 0)]))

coef.nonzero %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
variables coefs
(Intercept) 1.0539817
CALCIUM 0.1132904
CHLORIDE 0.1966729
COLOR 0.0037958
NITRITE_N 0.1419699
NTL 1.0873324
PH 0.1144994
PTL 0.3415499
TURB 0.4916500
OXYGEN 0.0054959
DEPTH 0.0216674
ABUNDANCE 0.5628738
PERIM_KM 0.3600837
INDEX_LON_DD 0.0003866
MMI_ZOOP_NLA6 0.0022615
LitRipCvrQc3OE 0.0239774
RVegQc3OE 0.0364415
MICX 0.0249451
ATRAZINE 0.0080291
METHYLMERCURY 0.0120455
TOTALHG 0.0590776
URBAN 0.0846511
# Compare the features selected using LASSO and stepwise
lasso.vars <- as.character(coef.nonzero$variables[-1])
lasso.vars
##  [1] "CALCIUM"        "CHLORIDE"       "COLOR"          "NITRITE_N"     
##  [5] "NTL"            "PH"             "PTL"            "TURB"          
##  [9] "OXYGEN"         "DEPTH"          "ABUNDANCE"      "PERIM_KM"      
## [13] "INDEX_LON_DD"   "MMI_ZOOP_NLA6"  "LitRipCvrQc3OE" "RVegQc3OE"     
## [17] "MICX"           "ATRAZINE"       "METHYLMERCURY"  "TOTALHG"       
## [21] "URBAN"
intersect(rownames(predStepwise), lasso.vars)
## [1] "NTL"       "ABUNDANCE" "TURB"      "PTL"       "URBAN"     "CHLORIDE" 
## [7] "CALCIUM"   "RVegQc3OE"
# Compare the features selected using LASSO and boruta
intersect(predBoruta, lasso.vars)
##  [1] "CALCIUM"       "CHLORIDE"      "COLOR"         "NTL"          
##  [5] "PH"            "PTL"           "TURB"          "OXYGEN"       
##  [9] "DEPTH"         "ABUNDANCE"     "INDEX_LON_DD"  "MMI_ZOOP_NLA6"
## [13] "MICX"          "ATRAZINE"      "METHYLMERCURY" "TOTALHG"
# Compare the features selected by LASSO, boruta and stepwise
comm_selected_vars <- intersect(intersect(predBoruta, rownames(predStepwise)), lasso.vars)
comm_selected_vars
## [1] "CALCIUM"   "CHLORIDE"  "NTL"       "PTL"       "TURB"      "ABUNDANCE"
all_selected_vars <- union(union(predBoruta, rownames(predStepwise)), lasso.vars)
all_selected_vars
##  [1] "AMMONIA_N"          "ANC"                "CALCIUM"           
##  [4] "CHLORIDE"           "COLOR"              "COND"              
##  [7] "DOC"                "MAGNESIUM"          "NITRATE_N"         
## [10] "NTL"                "PH"                 "POTASSIUM"         
## [13] "PTL"                "SILICA"             "SODIUM"            
## [16] "SULFATE"            "TURB"               "OXYGEN"            
## [19] "TEMPERATURE"        "DEPTH"              "CONDUCTIVITY"      
## [22] "SECCHI"             "ELEVATION"          "ABUNDANCE"         
## [25] "INDEX_LAT_DD"       "INDEX_LON_DD"       "MMI_BENT_NLA12"    
## [28] "MMI_ZOOP_NLA6"      "AMITOTAL"           "MICX"              
## [31] "ATRAZINE"           "METHYLMERCURY"      "TOTALHG"           
## [34] "URBAN"              "LAKE_ORIGIN"        "RVegQc3OE"         
## [37] "LAKE_ORIGINNATURAL" "NITRITE_N"          "PERIM_KM"          
## [40] "LitRipCvrQc3OE"

Feature selection using LASSO result in 21 features being selected as important, which is the name of features as the stepwise selection above. However, only 9 of the features selected by both stepwise and LASSO regression, and 14 features were selected as important both by LASSO and boruta methods. Finally, only 6 features were selected commonly by the three methods. This indicate a strong disaggreement between the feature selection methods.

Next, let us see the prediction power of a linear regression only using the 6 commonly selected features.

select.vars <- c("NTL", "PTL", "TURB","MICX", "ABUNDANCE","PH", "Chlorophyl")
select.vars <- c(comm_selected_vars, "Chlorophyl")
small.lm <- lm(as.formula(paste("Chlorophyl ~ ", paste(comm_selected_vars, collapse = " + "))), 
               data = lake_train[, select.vars])

summary(small.lm)
## 
## Call:
## lm(formula = as.formula(paste("Chlorophyl ~ ", paste(comm_selected_vars, 
##     collapse = " + "))), data = lake_train[, select.vars])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5734 -0.8593  0.1054  0.9602  5.2000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.47623    0.60608  -2.436 0.015045 *  
## CALCIUM     -0.22033    0.04968  -4.435 1.03e-05 ***
## CHLORIDE     0.04758    0.03416   1.393 0.163962    
## NTL          0.90841    0.08583  10.584  < 2e-16 ***
## PTL          0.25650    0.06589   3.893 0.000106 ***
## TURB         0.83515    0.05074  16.458  < 2e-16 ***
## ABUNDANCE    0.62133    0.06741   9.217  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.563 on 955 degrees of freedom
## Multiple R-squared:  0.7112, Adjusted R-squared:  0.7094 
## F-statistic:   392 on 6 and 955 DF,  p-value: < 2.2e-16
lm.small.pred <- predict(small.lm, lake_test[, select.vars])

# Calculate the RMSE
lm.small.perf <- performance(lake_test$Chlorophyl, lm.small.pred)
lm.small.perf
##                 [,1]
## model.rmse 1.5559103
## model.cor  0.8583315
## model.mae  1.1863575
## model.r2   0.7345242
perform.table$lm.small <- lm.small.perf 

Comparing models performance so far

models.perf <- data.frame (t(perform.table))
models.perf$models <- rownames(models.perf)

perf.long <- melt(models.perf, id.vars = "models" )

theme_set(theme_bw())

ggplot(perf.long, aes(x=models, y=value)) + 
  geom_point(size=5) + 
  geom_segment(aes(x=models, xend=models, y=0, yend=value)) + 
  labs(title="Models RMSE Chart") + ylab("RMSE")+
  theme(axis.text.x = element_text(size=14, angle = 60, hjust = 1),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=18,face="bold", color = "darkred"))+
  facet_wrap(~ variable)

library(fmsb)
models.perf <- data.frame (t(perform.table))
names(models.perf) <- c("RMSE", "Corr Coef", "MAE","R2")
models.perf <- rbind(rep(2,4), rep(0,4), models.perf)
col_fill <- c(rgb(0.5,0.2,0.8,0.2), rgb(0.5,0.2,0.1,0.2), rgb(0.5,0.8,0.1,0.2),rgb(0.2,0.2,0.2,0.2))
radarchart(models.perf, pfcol=col_fill)

As we can see from the results above, all the three regressions gave very similar results on all of the performance metrics chossen on the test data. In the next sections, I will try to use artificial neural network and see if we can improve the prediction better than what we obtained using the stepwise linear regression.

ANN Predictive modeling

In this section I will train an artificial neural network, and for this I will use the keras package. For comparison, I will use the feature selected by the stepwise linear selection.

Since the features have different measuring units and wide range of values, I will standardize the numeric features first.

Prepare the data for ANN

library(reticulate)
library(keras)
## First using only the features selected by stepwise model
# Normalize training data
step_vars <- rownames(predStepwise)
ann.vars <-c("SECCHI", "NTL", "ABUNDANCE", "TURB", "ELEVATION", "PTL",  "NITRATE_N", "DOC", "SILICA", "ALUMINUM",      "URBAN", "AMITOTAL", "COND", "LAKE_ORIGIN", "CALCIUM", "PH", "RDis_IX", "MMI_ZOOP_NLA6")
#ann.train.df <- select(lake_train, ann.vars)

train_std <- scale(select(lake_train[, ann.vars], -URBAN, -LAKE_ORIGIN)) 
dim(train_std)
## [1] 962  16
col_means_train <- attr(train_std, "scaled:center") 
col_stddevs_train <- attr(train_std, "scaled:scale")

# Replace back the factors we dont want to standardize them
train_std <- data.frame(train_std,
                        URBAN=ifelse(lake_train$URBAN == "Yes", 1,0),
                        LAKE_ORIGIN =ifelse(lake_train$LAKE_ORIGIN == "Natural", 1,0)
                        )

test_std <- scale(select(lake_test[, ann.vars], -URBAN, -LAKE_ORIGIN), 
                  center = col_means_train, scale = col_stddevs_train)
dim(train_std)
## [1] 962  18
# Use means and standard deviations from training set to normalize test set

# Replace back the factors we dont want to standardize them
test_std <- data.frame(test_std, 
                       URBAN=ifelse(lake_test$URBAN == "Yes", 1,0),
                       LAKE_ORIGIN =ifelse(lake_test$LAKE_ORIGIN == "Natural", 1,0)
                       )

yTrain <- lake_train$Chlorophyl
yTest <- lake_test$Chlorophyl

Create the model

Next, I will build the ANN model with two hidden layers and one output node for the regression. For start, each hidden layers will have 64 nodes with relu activation function. The output layer, since it is regression, I will use one node will a linear activation function.

trainX <- as.matrix(train_std)  # Features selected by stepwise only)

build_model <- function() {
  
  model <- keras_model_sequential() %>%
    layer_dense(units = 64, activation = "relu", input_shape = dim(trainX)[2]) %>%
    layer_dense(units = 64,kernel_initializer = 'normal', activation = "relu") %>%
    layer_dense(units = 1, kernel_initializer = 'normal', activation = "linear")
  
  model %>% compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(lr=0.0001),
    metrics = list("mean_absolute_error")
  )
  
  model
}

model <- build_model()
model %>% summary()
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 64)                    1216        
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 64)                    4160        
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 1)                     65          
## ===========================================================================
## Total params: 5,441
## Trainable params: 5,441
## Non-trainable params: 0
## ___________________________________________________________________________

Train the model

Next, the model will be trained for 500 epochs.

# Display training progress by printing a single dot for each completed epoch.
print_dot_callback <- callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 80 == 0) cat("\n")
    cat(".")
  }
)

# The patience parameter is the amount of epochs to check for improvement.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 100)

epochs <- 500
set.seed(12)

# Fit the model and store training stats
history <- model %>% fit(
  trainX,
  yTrain,
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(print_dot_callback)
)
## 
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ................................................................................
## ....................
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(ylim = c(0,5))+ geom_point(aes(alpha=1))+ guides(alpha=F)

min(history$metrics$val_mean_absolute_error)
## [1] 1.041282

Check performance of the ANN model on test data

c(loss, mae) %<-% (model %>% evaluate(as.matrix(test_std), yTest, verbose = 0))

paste0("Mean absolute error on test set: ", sprintf("%.2f", mae))
## [1] "Mean absolute error on test set: 1.00"
ann_test_pred <- model %>% predict(as.matrix(test_std))

ann.perf <- performance(yTest, ann_test_pred)
ann.perf
##                 [,1]
## model.rmse 1.3320329
##            0.8977825
## model.mae  1.0034594
## model.r2   0.8054256
perform.table$ann.perf <- ann.perf

We again see that the performance of the ANN model on the test model is almost exactly the same as the multiple linear regression will all the features and the stepwise linear model. Hence, for the sake of simplicity and interpretablity in this case we have to chose the linear model with stepwise selection. However, this doesn’t mean the ANN model built here is the best we can get. There are a number of hyperparameters we can playwith to see if we can improve the models performance. We can change the number of hidden layers and nodes, we can change the activation function and the learning rate, and we can also change the model structure.

Another, machine learning technique we could try is the tree based regression which I will be training next.

Tree based regresssion

Next, will try the prediction using a tree based regression with the rpart package.

library(rpart)

set.seed(1234)
rpart_train <- data.frame(train_std)
rpart_train$Chlorophyl <- yTrain

lakes.rpart<-rpart(Chlorophyl~., data=rpart_train)
lakes.rpart
## n= 962 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 962 8080.29600  4.2904460  
##    2) TURB< 0.07309393 536 1889.46500  2.4876570  
##      4) SECCHI>=0.627837 275  661.46530  1.5363600  
##        8) NTL< -0.9851275 111  206.58060  0.6644478 *
##        9) NTL>=-0.9851275 164  313.38440  2.1264960 *
##      5) SECCHI< 0.627837 261  716.91850  3.4899820  
##       10) TURB< -0.8923215 20   76.32661  1.1272760 *
##       11) TURB>=-0.8923215 241  519.67880  3.6860580 *
##    3) TURB>=0.07309393 426 2256.96100  6.5587440  
##      6) NTL< 0.6623992 230  778.34720  5.3506650  
##       12) NTL< 0.03702325 100  292.80300  4.3990440 *
##       13) NTL>=0.03702325 130  325.32570  6.0826810  
##         26) ABUNDANCE< -0.8379097 22   34.05670  4.2553210 *
##         27) ABUNDANCE>=-0.8379097 108  202.84090  6.4549210 *
##      7) NTL>=0.6623992 196  749.03460  7.9763880  
##       14) TURB< 0.7057876 65  142.54000  6.5307150 *
##       15) TURB>=0.7057876 131  403.24100  8.6937070  
##         30) NTL< 1.659964 95  184.65750  8.1754330 *
##         31) NTL>=1.659964 36  125.72720 10.0613700 *
library(rpart.plot)
rpart.plot(lakes.rpart, digits=3, main="Tree based regression result")

rpart.plot(lakes.rpart, digits = 4, fallen.leaves = T, type=3, extra=101,
           main="Tree based regression result", cex=0.85)

The tree regression using rpart chose the TURB feature (which turbidity of the lake) as the root followed by NTL(Total nitrogen in the lake), secchi (measure of transparacy of the lake) and ABUNDANCE (which is abudance of phytoplankton in the lakes). We can test the models performance on the test data as follows. This results makes sense since these parameters are directly related to algal growth and eutrophication of lakes.

lak.rp.pred<-predict(lakes.rpart, data.frame(test_std))
summary(lak.rp.pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6644  2.1265  3.6861  4.2457  6.4549 10.0614
rpart.perf <- performance(yTest, lak.rp.pred)
rpart.perf
##                 [,1]
## model.rmse 1.8360633
## model.cor  0.7953457
## model.mae  1.3843657
## model.r2   0.6303156

Again we see that rpart was not able to yield a better result as compared to the previous models. Next, I will try to train xgboost algorithm, which is an acronym for extrem gradient boosting , a decision trees bsed algorithm.

Regression using xgboost algorithm

Now, I will training an xgboost algorithm which is know for its high efficiency, flexibility and portability. Again, similar to ANN, xgoost has a number of tuning parameters to imporve the models performance. For this work I will use commonly used values for regression. The implementation of xgboost algorithm in R is found in the ‘xgoost’ algorithm.

library(xgboost)
dtrain <- xgb.DMatrix(data = xTrain, label = yTrain)
lake.xgb <- xgboost(data = dtrain, max_depth = 6, eta = 0.01, gamma = 0.01, nthread = 3, nfold=5,
                      nrounds = 1000,eval_metric = "rmse", objective = "reg:linear", verbose = 0)

plot(lake.xgb$evaluation_log, xlab='number of trials', 
     col="skyblue3", lty=3, type="p", main="RMSE of xgboost on the training data")

Now we have trained model, let us test its performance on the test data.

## Predictions
xbg.preds <- predict(lake.xgb, newdata = xTest)
xgb.perf <- performance(yTest, xbg.preds)
xgb.perf
##                 [,1]
## model.rmse 1.3799232
## model.cor  0.8896438
## model.mae  1.0032522
## model.r2   0.7911831
perform.table$xgb <- xgb.perf

The performance of xgboost was not significantly different from the other models. However, there is a possibility to improve the performance by further tunning the hyperparameters.

The following table gives the summary of all models test so far. As it can be seen all of the models tried so far had very similar performance. Evenif the performance are not excellent, but I believe they could be considered acceptable.

perform.table %>% kable(digits = 3) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12)
lm.all lm.step lm.step.cv LASSO lm.small ann.perf xgb
model.rmse 1.369 1.389 1.3947268 1.4310867 1.5559103 1.3320329 1.3799232
model.cor 0.892 0.888 0.8766364 0.8836414 0.8583315 0.8977825 0.8896438
model.mae 0.988 0.979 1.0333327 1.0736569 1.1863575 1.0034594 1.0032522
model.r2 0.794 0.788 0.7684066 0.7754114 0.7345242 0.8054256 0.7911831

Comparing models performance so far

To recap, we can visualize the performance of all the models trained and tested so far as shown below.

models.perf <- data.frame (t(perform.table))
models.perf$models <- rownames(models.perf)

perf.long <- melt(models.perf, id.vars = "models" )

theme_set(theme_bw())

ggplot(perf.long, aes(x=models, y=value)) + 
  geom_point(size=5) + 
  geom_segment(aes(x=models, xend=models, y=0, yend=value)) + 
  labs(title="Models RMSE Chart") + ylab("RMSE")+
  theme(axis.text.x = element_text(size=14, angle = 60, hjust = 1),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=18,face="bold", color = "darkred"))+
  facet_wrap(~ variable)

Conclusion for regression

I have trained and test several well known machine learning algorithms that have been used to solve many complex Data science problems that have been applied in different sectors. The performance I have obtained by training these models can be considered acceptable and satisfactory. And these models can be tested and used for similar applications. In general, all the analysis done including dimensional reduction and feature selection indicate that the problem is truely high dimensional and can to be reduced to a lower dimension. The other performance limitation observed in the models, particularly in the linear models, can be attributed to the existance of outlinears. I believe this can be improved by further including non-linearity and interaction between features in the models. Once such example could be using a generalize additive models (GAM). To improve the performances of ANN and xgboost, further hypertuning of the model parameters can be tried.



Lakes Clustering

In this section, I will apply different unsupervised classification techniques to identify different clusters among lakes. One know way of classifying lakes is based on lakes trophic state. Lakes trophic state is a measure of lake’s total biomass at specific time and space. An indirect measurement is lakes trophic state index (TSI), which uses chlorophyll-a, total phosphorus, secchi depth, and total nitrogen based on Carlson 11 (1977). We can calculate the TSI using the formula derived by Carlson, and we will compare the obtained TSI with clustering algorithms such as t-SNE and kmeans. Based on TSI lakes are classified in to four states–oligotrophic, mesotrophic, eutrophic and hypereutrophic, which is from least nutrient and biomass rich (clean lake) to most rich. To do the TSI clustering I will use only the fours features that are also used to calculate TSI. Finally, I will explore the existance of other clustering between lakes using the entire feature data

Kmeans clustering

First, I will calculate TSI based on the commonly used eperical formula by Carlson, and identify the trophic state of the lakes. The result will be used to compare the performance of the clustering algorithms I will be using.

Now we have the trophic state of the lakes, we can start working on the clustering. Frist, I will use kmeans to do the clusters and compare it with t-SNE.

# Scale the data
lakes_scaled <- data.frame(scale(lakes.df[,1:42]))
lakes_scaled$URBAN <- ifelse(lakes.df$URBAN =="YES",1,0)
lakes_scaled$LAKE_ORIGIN <- ifelse(lakes.df$LAKE_ORIGIN =="NATURAL",1,0)

# Run kmeans using the four features and four clusters
set.seed(3435)
lakes_clusters4<- kmeans(lakes_scaled[,c("NTL","PTL","SECCHI","Chlorophyl")], 4, nstart = 20, iter.max = 30)

Clustering using t-SNE

Next, I will do the clustering using t-SNE and compare both kmeans and t-SNE with the emperically calculated TSI results.

library(Rtsne)
set.seed(3434)
tsne_lakes <- Rtsne(lakes.df[,c("NTL","PTL","SECCHI","Chlorophyl")],
                    dims = 2, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.47 seconds (sparsity = 0.266508)!
## Learning embedding...
## Iteration 50: error is 54.431234 (50 iterations in 0.20 seconds)
## Iteration 100: error is 52.519277 (50 iterations in 0.18 seconds)
## Iteration 150: error is 52.512390 (50 iterations in 0.16 seconds)
## Iteration 200: error is 52.512236 (50 iterations in 0.16 seconds)
## Iteration 250: error is 52.514158 (50 iterations in 0.16 seconds)
## Iteration 300: error is 0.680049 (50 iterations in 0.19 seconds)
## Iteration 350: error is 0.584024 (50 iterations in 0.19 seconds)
## Iteration 400: error is 0.562367 (50 iterations in 0.20 seconds)
## Iteration 450: error is 0.555057 (50 iterations in 0.20 seconds)
## Iteration 500: error is 0.551683 (50 iterations in 0.20 seconds)
## Fitting performed in 1.83 seconds.

Let us plot the results of kmeans and t-SNE and overally them with the calculate TSI values.

clrs2 <- c("darkblue", "deepskyblue3","chartreuse4","black")
trophic.col <- trophic
levels(trophic.col) <- clrs2
trophic.col <- as.character(trophic.col)

par(mfrow=c(1,2), omi=c(.1,.3,.4,.1))
plot(tsne_lakes$Y,  t="n", sub = "t-SNE vs calculated TSI", col.sub="brown" ) 
text(tsne_lakes$Y, labels = trophic.num, col=trophic.col) #, pch = lakes_clusters4$cluster)
legend("topleft", levels(trophic), text.col = clrs2, bg='gray90', cex=0.8, pch=levels(trophic.num))

plot(tsne_lakes$Y, pch=levels(factor(lakes_clusters4$cluster)), 
     col=clrs2[trophic.num], sub = "kmeans vs calculated TSI", col.sub="brown" )
legend("topleft", levels(trophic), text.col = clrs2, bg='gray90', cex=0.8, pch=levels(trophic.num))

mtext("t-SNE vs K-means Clusters for US lakes", side = 3, cex=1.5, outer = T)

# Check k-means accuracy
caret::confusionMatrix(factor(lakes_clusters4$cluster),trophic.num)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   2 173 214   0
##          2   0   0 241 119
##          3 151  52   0   0
##          4   0   0   1 177
## 
## Overall Statistics
##                                          
##                Accuracy : 0.1584         
##                  95% CI : (0.1376, 0.181)
##     No Information Rate : 0.4035         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -0.0843        
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity          0.013072   0.0000   0.0000   0.5980
## Specificity          0.603889   0.6022   0.6988   0.9988
## Pos Pred Value       0.005141   0.0000   0.0000   0.9944
## Neg Pred Value       0.796221   0.7078   0.5081   0.8750
## Prevalence           0.135398   0.1991   0.4035   0.2619
## Detection Rate       0.001770   0.0000   0.0000   0.1566
## Detection Prevalence 0.344248   0.3186   0.1796   0.1575
## Balanced Accuracy    0.308481   0.3011   0.3494   0.7984

We see here that on the left figure, t-SNE has a strong agreement with the calculated TSI. The figure on the right is a bit complex, the location of the observations are based on t-SNE coordinate, the colors are based on the four trophic stated as calculated by the TSI, and the labels(numbers) are based on kmeans clustering. Thus, we observe that kmeans (also from the confusion matrix) has a very poor agreement with both TSI and t-SNE results.

set.seed(33)
tsne_lakes3d <- Rtsne(lakes.df[,c("NTL","PTL","SECCHI","Chlorophyl")],
                    dims = 3, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 4 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.49 seconds (sparsity = 0.266508)!
## Learning embedding...
## Iteration 50: error is 54.563281 (50 iterations in 0.45 seconds)
## Iteration 100: error is 52.482003 (50 iterations in 0.28 seconds)
## Iteration 150: error is 52.475115 (50 iterations in 0.25 seconds)
## Iteration 200: error is 52.474545 (50 iterations in 0.25 seconds)
## Iteration 250: error is 52.473286 (50 iterations in 0.25 seconds)
## Iteration 300: error is 0.621888 (50 iterations in 0.34 seconds)
## Iteration 350: error is 0.509386 (50 iterations in 0.38 seconds)
## Iteration 400: error is 0.483002 (50 iterations in 0.38 seconds)
## Iteration 450: error is 0.473704 (50 iterations in 0.40 seconds)
## Iteration 500: error is 0.466167 (50 iterations in 0.40 seconds)
## Fitting performed in 3.38 seconds.
tsne3d <- data.frame(tsne_lakes3d$Y, colr = trophic.col)

info <- left_join(select_lakes2[, c("UID","URBAN","LAKE_ORIGIN")], lakes_info12[,c("UID", "NARS_NAME","STATE")])

plot_ly(tsne3d, x=~X1, y=~X2, z=~X3, colors = trophic.col) %>%
  add_markers(text= ~paste('Lake name:', info$NARS_NAME, '<br> State:', info$STATE)) %>%
  layout(scene = list(xaxis = list(title = 'D1'),
                     yaxis = list(title = 'D2'),
                     zaxis = list(title = 'D3')))

As it can be seen from the plot above t-SNE was able to cluster the lakes in the same way as the emperically calcualted TSI levels, however kmeans had a very low performance with prediction accuracy less the 50%.

Discover clusters using all features

In this section, I will try to use the entire features and explore is there is another clustering among lakes. I will start with kmeans first, and then use t-SNE to answer this question.

Kmeans clustering

library(cluster)
k=seq(2,100,4)
lakes_clusters=vector("list", length(k))
within.ss=vector("list", length(k))
between.ss=vector("list",length(k))
sil=vector("list",length(k))

set.seed(3456)
for (i in 1:length(k)) {
  lakes_clusters[[i]]<- kmeans(lakes_scaled, k[i], nstart = 20, iter.max = 40)
  within.ss[[i]] <- lakes_clusters[[i]]$tot.withinss
  between.ss[[i]] <- lakes_clusters[[i]]$betweenss
  
}

Plot the results of kmeans if we can identify clear clusters from the data.

plot(k, within.ss, xlab = "k values", ylim = c(1000,50000), ylab = "sum of squared errors", type="b", lwd=2, pch=18)
lines(k, between.ss, col="royalblue2", type="b", lwd=2, pch=20)
legend("bottomright",c("Total within clusters","Between clusters"), lwd=c(2,2), col=c("black","royalblue2"), pch=c(18,20))
title("Clusters Variance vs Cluster Numbers")

Let us plot for the case when we have two clusters using the function clusplot from cluster package to see if we get clear clusters.

clusplot(lakes.df, lakes_clusters[[1]]$cluster, color=TRUE, 
         shade=TRUE, labels=2, lines=0,
         main = "kmeans with k=2 clusters")

As it can be seen from the above visualizations, kmeans does not yield clear and distinct clusters, but roughly we can see 6 to 10 clusters can be identified. Next, we will use t-SNE if we can identify clear clustering of lakes based on the overall characteristics of the lakes.

t-SNE clustering

set.seed(11)
tsne_lakes2 <- Rtsne(lakes.df, dims = 2, perplexity=80, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 1130 x 45 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 80.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.51 seconds (sparsity = 0.267741)!
## Learning embedding...
## Iteration 50: error is 60.389727 (50 iterations in 0.21 seconds)
## Iteration 100: error is 56.506053 (50 iterations in 0.19 seconds)
## Iteration 150: error is 56.492306 (50 iterations in 0.20 seconds)
## Iteration 200: error is 56.463207 (50 iterations in 0.20 seconds)
## Iteration 250: error is 56.452528 (50 iterations in 0.20 seconds)
## Iteration 300: error is 0.820401 (50 iterations in 0.17 seconds)
## Iteration 350: error is 0.746273 (50 iterations in 0.17 seconds)
## Iteration 400: error is 0.727430 (50 iterations in 0.17 seconds)
## Iteration 450: error is 0.719992 (50 iterations in 0.17 seconds)
## Iteration 500: error is 0.715898 (50 iterations in 0.17 seconds)
## Fitting performed in 1.86 seconds.
plot(tsne_lakes2$Y, main="t-SNE Clusters for US lakes", col=lakes.df$URBAN)
legend("topright",c("Urban", "Rular"), col=c("black","red"), pch = c(1,1))

We can see that t-SNE has identified two groups of clusters but they don’t corresponde to urban versus rular lakes. In addition, other than saying there are two groups, it is difficult understand what these two clusters mean. I will do additional clustering using hierarchical clustering and compare the result with t-SNE in the next section.

Hierarchical Clustering

hc.complete = hclust (dist(lakes_scaled), method ="complete")

plot(hc.complete ,main =" Complete Linkage ", xlab ="", sub ="", cex =.6, labels=info$STATE)
rect.hclust(hc.complete, k = 4, border = 2:5)

# Cut tree into 4 groups
sub_grp <- cutree(hc.complete, k = 4)
table(sub_grp)
## sub_grp
##   1   2   3   4 
## 559 517  53   1

We notice here that cluster 4 contains only 1 lake and cluster 3 contains 53. The majority of the lakes are found in the first two clusters. From this we could say we have two major clusters.

We can also use the function fviz_cluster from the factoextra package to get a 2d visualization of the four clusters.

library(factoextra)
fviz_cluster(list(data = lakes_scaled[,-43], cluster = sub_grp)) # URBAN had 0 variance, remove it

Cluster 1 and 3 seem more distinct when cluster 2 could be split into these two clusters. One observation that falls into cluster 4 could anommaly or could be in cluster 3 if we see it from low dimension only.

What is the optimal heirarchical cluster?

We can also try to get the optimal number of hierarchical clusters. For this we can use fviz_nbclust function from factoextra package as shown below.

fviz_nbclust(lakes_scaled, FUN = hcut, method = "wss")

fviz_nbclust(lakes_scaled, FUN = hcut, method = "silhouette")

Based on the within variance measure it is not clear how many clusters are optimal, but based the silhouette measure the optimal clusters are 2, which is also in agreement with the t-SNE algorithm.

Let use 2 clusters and get thier labels which will use to compare with t-SNE results.

sub_2grp <- cutree(hc.complete, k = 2)
fviz_cluster(list(data = lakes_scaled[,-43], cluster = sub_2grp))

table(sub_2grp)
## sub_2grp
##    1    2 
## 1076   54

We can also use ward hierarchical clustering method as a comparison.

hc.wd = hclust (dist(lakes_scaled), method ="ward.D")
plot(hc.wd ,main ="ward.D method ", xlab ="", sub ="", cex =.6)
rect.hclust(hc.wd, k = 2, border = 2:5)

sub_2g.wd <- cutree(hc.wd, 2)
table(sub_2g.wd)
## sub_2g.wd
##   1   2 
## 742 388
fviz_cluster(list(data = lakes_scaled[,1:42], cluster = sub_2g.wd))

Comparing t-SNE and hierarchical clustering

We can also compare the results of t-SNE and hierarchical clustering with the aid of visualization as follows.

par(mfrow=c(1,2), oma=c(1,1,2,0.2))
plot(tsne_lakes2$Y, col=sub_2g.wd, sub="With ward.D method")
legend("topright",c("Cluster1", "Cluster2"), col=c("black","red"), pch = c(1,1), title = "Hierarchical clustering")

plot(tsne_lakes2$Y, col=sub_2grp, sub="With complete linkage method")
legend("topright",c("Cluster1", "Cluster2"), col=c("black","red"), pch = c(1,1), title = "Hierarchical clustering")

mtext("t-SNE vs Hierarchical Clustering of US Lakes for the Year 2012",side = 3, outer = T, cex = 1.5)

As it can be seen from the above visualization, the two clustering methods do not yield similar result similar to what was observed between t-SNE and kmeans.

Conclusion for clustering

I have done clustering of lakes using several techniques including kmeans, hierarchical clustering and t-SNE. Again the results indicate the complexity of the problem. However, we have got useful insight how we could cluster the lakes, especially the results obtained from t-SNE indicate that we have clear 2 cluster of lakes that exist. Further analysis can be done to identify what distinguishes this two clusters.


Acknowledgement

I would like to thank my wife, Sara, for her support and patience during this work. I would like also to acknowledge my Prof. Ivo D. Dinov, for his insightful teaching and inspiration for the entire course inluding the project.


References


  1. USEPA. 2017. National Lakes Assessment 2012: Techical Report. EPA 841-R-16-114. U.S. Environmental Protection Agency, Washington, D.C.

  2. Amina I. Pollard, Stephanie E. Hampton, and Dina M. Leech. The Promise and Potential of Continental-Scale Limnology Using the U.S. Environmental Protection Agency’s National Lakes Assessment. Association for the Sciences of Limnology and Oceanography. (2018)

  3. USEPA. 2017. National Lakes Assessment 2012: Techical Report. EPA 841-R-16-114. U.S. Environmental Protection Agency, Washington, D.C.

  4. Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7

  5. BoqiangQin. Lake eutrophication: Control countermeasures and recycling exploitation. Ecological Engineering. (2009), pages 1569-1573

  6. Jeremy Mack. Eutrophication. Retrieved from http://www.lakescientist.com/eutrophication/

  7. Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7

  8. USEPA. Indicators: Chlorophyll a. Retrieved from https://www.epa.gov/

  9. Bhateria, R. & Jain, D. Sustain. Water quality assessment of lake water: a review. Water Resour. Manag. (2016) 2: 161. https://doi.org/10.1007/s40899-015-0014-7

  10. Biswajit Bhagowati, Kamal Uddin Ahamad. A review on lake eutrophication dynamics and recent developments in lake modeling. Ecological Engineering. (2018), Pages 1569-1573

  11. Robert E. Carlson. A trophic state index for lakes. Limnology and Oceanography. (1977) 2. Pages 361-369. https://doi.org/10.4319/lo.1977.22.2.0361